Archived

This topic is now archived and is closed to further replies.

problem using intrinsic sqrt with FPU programming

This topic is 5112 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm just fiddling around with FPU programming. I've tried this code with and without the 'generate intrinsic functions' setting in the visual C++ 6 compiler. this code does not work properly for some reason, and I'm not sure why This is my first day of FPU stuff. I'm using Tricks of the 3D game programming GURUS as a resource (LaMothe), pages 456+
void	Vector3::Normalize()
{

#if	1
		float	basic = x*x + y*y + z*z;

if(basic>.0001)
{		
		_asm
		{
			FLD	basic; //Put basic length into st(0)

			FSQRT;		//Put SQUARE ROOT of basic into st(0)

			FST basic; //Store square root of basic into memory address of basic

		}
	//	if(basic) //won't ever be exactly zero

	//	{

			//Can't seem to access this->x,y,or z

			float	x1 = x;
			float	y1 = y;
			float	z1 = z;
			_asm
			{
				FLD	x1;			//Load into FPU stack element zero

				FDIV	basic;	//divide by length

				FSTP	x1;		//store in x1 and pop stack 


				FLD	y1;
				FDIV	basic;
				FSTP	y1;

				FLD z1;
				FDIV	basic;
				FSTP	z1;
			}
			x =	x1;
			y = y1;
			z = z1;
		//}

}	
#else

	float	magnitude = sqrt( (x * x) + (y * y) + (z * z) );
	if(magnitude)
	{
		x /= magnitude;
		y /= magnitude;
		z /= magnitude;
	}
	
#endif
}
EDIT:: Okay, I really don't know what i'm doing. I changed the line of code: FST basic; //Store square root of basic into memory address of basic to this: FSTP basic; //Store square root of basic into memory address of basic the only difference is the p, which denotes that it pops the stack. I'm not really sure why that is significant [edited by - Shadow12345 on December 15, 2003 3:59:26 PM] [edited by - Shadow12345 on December 15, 2003 4:02:16 PM] [edited by - Shadow12345 on December 15, 2003 4:02:44 PM]

Share this post


Link to post
Share on other sites
You gave the answer yourself. Fstp pops the stack, fst does not. If you dont do it your function will end up with the FPU stack non empty. Execute your routine at most 7 more times (in fact less here) and you are sure to generate a stack overflow error. Next there is pointless, inefficient and even dangerous stuff in your code.

- x1=x; y1=y; z1=z; is pointless, x,y,z are already in memory, so no need to copy them on the (esp) stack.

- intermediate fstp should be absolutely avoided, if you code in asm just keep variables in registers (st(x)). First fstp adds latency and disorganizes instruction parallelism. Next it''s a clear immediate store and load depedency, very damageable. And last but not the least it''s totally unecessary.

- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

The code should simply be this one, I bet it''s at least at least 3 times faster :
__asm{
fld x
fmul
fld y
fmul
fld z
fmul
fxch st(2)
faddp st(1), st
faddp st(1), st
fsqrt
fld1
fdiv
fld x
fmul st(1) ; x, k
fld y
fmul st(2) ; y, x, k
fxch st(1) ; x, y, k
fld z ; z, x, y, k
fxch st(3) ; k, x, y, z
fmulp st(3), st
fstp x
fstp y
fstp z
}
//Done

Try to see what''s on the stack at each step. Your function must end up with a free stack since you return no float. That''s the problem you had. This will be your homework I don''t need it, I wrote this code in 30 seconds (I am trained to asm, any kind of). Anyway I would never use such a code myself. Try to google about quick reciprocical sqrt (rsqrt) approximations because fsqrt fdiv is a damn slow sequence. More your function is too small to parallelize instructions. So don''t use asm here.

Here is my advice : for such functions, keep your code C, inline of course. This enables your compiler to parallelize code when you "call" several small functions in sequence. In general keep asm for loops, such as array processing, where you can unroll. Only reimplement some stdlib like math.h functions to speed em up. Thus you will remove unecessary instructions the std lib adds for numerical precision and ANSI compliance. In general you don''t need it in 3D coding. What counts is FPS, and many std C/C++ libs include ultimately slow functions like memcpy, atoi, (int)float conversions, malloc, sqrt, etc ...

Use a profiler or a disassembly code in details once and you''ll be very surprised by the dirty work the compiler often does.


Share this post


Link to post
Share on other sites
yeah i know you're supposed to multiply by the reciprocal, i just wanted the damn thing to work, i didn't care how.
I can't directly load x onto the stack.

this error is generated when I do this:
FLD x
instead of
FLD x1

quote:

error C2420: 'x' : illegal symbol in first operand



your code generates the same errors. it might be because of the fact that I'm not doing this on an intel machine, lol. I didn't actually think about that until after I coded it and read in the book that this is all for intel machines...

[edited by - Shadow12345 on December 16, 2003 4:02:53 PM]

[edited by - Shadow12345 on December 16, 2003 4:03:52 PM]

[edited by - Shadow12345 on December 16, 2003 4:04:54 PM]

Share this post


Link to post
Share on other sites
quote:
Original post by Charles B
- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)



What compilers would take the liberty to do such a transformation?
I guess a/k is in general different from a*(1.f/k) with IEEE754 floating point.
So you would need a switch such as --no-std-fp, no?


BTW, with GCC inline asm syntax with inputs/outputs/clobber, you can often code this kind of functions in a real good way, with the compiler having good insight of the side effects, allowing it to schedule your assembly. The syntax is a bit hard to learn, but I just love it (works like a charm on PS2 EE & VU0, you can have the compiler manage the SIMD register allocation, and have very good general code without too much effort).

ie.

[cpp]
inline u_long128 AddVector4 (u_long128 a, u_long128 b)
{
register u_long128 ret;
asm ("vadd %0, %1, %2" : "=j"(ret) : "j"(a), "j"(b));
return ret;
}
[/cpp]

If the u_long128 you pass as a parameter is in memory, GCC will generate code that will load it in a "j" (macroVU0) register.
If it is in a GPR, it will do it too in another way.
And best, if those are already in registers, it will just call vadd!
(The same for outputs)

Share this post


Link to post
Share on other sites
do either of you know why i am getting the error with trying to access the xy and z components in the vector class? They're private, and I access the x,y,z components from anywhere else in my program using normal c++ code

[edited by - Shadow12345 on December 16, 2003 9:31:24 PM]

Share this post


Link to post
Share on other sites
quote:
- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)



Multiple Pipelines - divides CAN be as fast as multiplies (in practive) if you do it correctly. Don''t forget that the pentium and above (includes k7''s etc) do this well. BTW - the Athlon I use when fully optimised takes only 5% more time when doing divides over multiples when using two pipes.

In a single pipe you are spot on though. The above athlon is giving about 20 times (forgot the exact number) as many clock ticks as the multiply.

Share this post


Link to post
Share on other sites
oh, and here is something that you should learn if you want to make fast functions, especailly using assemby

__declspec(naked) __fastcall int func(int a,int b,int c)
{
// your stuff
__asm
{
xor eax,eax
ret
}
}


For compatibility sake, if you have both use __msfastcall, (if your on msvc you only have this one - it''s called __fastcall)

MSVC style __fastcall (__msfastcall for the other compilers) returns the first two parameters into the LAST two registers

The normal style __fastcall (__fastcall) returns the first three parameters in the the FIRST three registers. - msvc doesn''t have this version of the fastcall function - they only have the __msfastcall version.


Using declspec(naked) tells the compiler not to create the functions entry and exit points etc. This means the stack space, the code it starts with and the closing code. You have to do it yourself. BUT it is the fastest way to call any function.

Share this post


Link to post
Share on other sites
thank you all for posting that information, it''s very helpful to know these things. However I still cannot compile the code, I get the error above. The error occurs when I try to fld the x y or z member of the vector class into the register. x y and z are public members of the class (i don''t use private).

Share this post


Link to post
Share on other sites
quote:
Original post by shadow12345
thank you all for posting that information, it''s very helpful to know these things. However I still cannot compile the code, I get the error above. The error occurs when I try to fld the x y or z member of the vector class into the register. x y and z are public members of the class (i don''t use private).




This is a limitation to the inline assembly syntax with members :-/

It understands ''.'' but not ''->'' so you can''t use this->x.
It doesn''t understand ''(*this).'' either.

Easy way:
You can work around it by doing something like this
[cpp]
const Class& rthis = *this;
__asm {
fld [rthis.x];
}
[/cpp]
But you''d better verify the generated code is not too dumb for your purpose.

Harder way:
Read the offset in assembly and so on...

Better way:
Use GCC 3.2 and its great asm syntax (great by power, not great by %eax% % shit!)
(But I prefer MS IDEs to VI, emacs and so on )

Hope this helps

Bertrand

Share this post


Link to post
Share on other sites
Yeah right I did not try the code thus I did not explain the fld x stuff in detail. you are right you can only access local objects with inline asm. Thus data stored on the stack and pointed by esp or ebp. In fact what you need is an integer register to hold the this pointer. This depends on your calling convention. If I remember well Visual C++ arcanes __thiscall for C++ puts this in ebx and in ecx for the equivalent __fastcall C function. See the docs.

So the code becomes :

fld [ecx]
...
fld [ecx+4]
...
etc..

With __stdcall or __cdecl you would have to load the register from the stack, add something like that in preamble :

mov ecx, this;


Else about fdivs, yes the compiler can do these optimizations, it's conditionned by the floating point directive : floating point consistency. I have already speced that with VisualC++ many times.

Now about 3 fdiv latencies working in parallel I am not certain about it. I am sure it will not be faster on the latest CPUs and I am sure it will be much slower on older CPUs. Anyway again what must be done here is a NR approximation of RSqrt which will remain much faster than the fsqrt fdiv on any processor I have tested for my math lib. You can reach a 24bis precision with 2 or 3 NR iterations.


[edited by - Charles B on December 18, 2003 11:59:10 PM]

Share this post


Link to post
Share on other sites
i think someone already commented on this, but u can''t do something like

mov eax, ptr->x

because it would require an extra instruction. in this case, ptr is a memory address that has to be resolved, and after resolution, the offset of x must be added to the resolved value:

mov eax, [ptr]
mov eax, [eax + offset]

this is why you should not use pointers unless its advantageous for some other reason(s). or if u have to use pointers, but your going to be working with the ''pointed-to'' data for a while, resolve the memory address once and keep using it (i.e. cache it). granted, the first time u resolve a memory address it will probably be moved into cpu cache and further de-referencing will be much faster than if it were kept in system memory, i just think it''s good practice.

in C, this is analagous to

void func(someStruct * RemoteStruct)
{
someStruct LocalStruct;
LocalStruct = *RemoteStruct;
// ...do lots of stuff with LocalStruct here.
}

Share this post


Link to post
Share on other sites
Jorgander , part of what you say is true. But the real answer to Shadow is in my previous post. In the case of vectors somewhere in memory the best is to let the compiler store the pointer in a register before the call by setting the appropriate calling coventions.

Else about copying the structure locally.

Well in some cases where it's worth doing it. But copy standard type variables rather than whole structs. It can be relevant for register pressure, register allocation, and when else you would have to access multiple pointers, specially in a loop. But there are drawbacks too. Here a good example :

// Very good practice :
int n1 = pMesh1->numVerts;
int n2 = pMesh2->numVerts;
//...
// Some loop.

Because if n1,n2 can fit into registers, that's good. It's also a valuable undirect information for the compiler. For instance these two instructions do not have exactly the same meaning :

if(inumVerts)
if(i
This :

solves pointer aliasing
pMesh1->numVerts could be accessed through pointer aliasing. Through a g_pNumVerts1 somewhere it could be modified. The compiler has no way to warrant that it won't happen in most cases, thus he keeps accessing the memory every time you use the data. You can set the various compiler directives, but some compiling paths will not have enough information. Since n1 is local (on the stack or in a register) the compiler knows there is no aliasing possible if none is detected inside the function scope. Thus can keep the data in a register as long as possible.

diminishes register pressure.
You free two registers, one for pMesh1 and one for pMesh2. These two more registers can greatly improve tight inner loops.

makes your syntax more readable
More concise, less cumbersome, this clearly shows which are the actual structure variables used.

But this practice is nonsense for small vector functions. There is only one pointer. Use __fastcall or __thiscall, avoid unecessary preambles, AGIs, and store load dependencies (*) . Modern CPUs have huge store/load dependencies. What's the point in adding unecessary load (pointed data), then store (local structure), then read (local structure) with a lot of stalls (*).


[edited by - Charles B on December 18, 2003 12:31:10 AM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Hey,
The answer is already written within theses pages. I just wanted to point out that the book you are currently using is really darn old. The reality is that it is faster to compute a sqrt with what MS VC .NET provides. If you really want speed, you have to go SIMD and compute the inverse square root which you can multiply by x. Your taclking of asm is honorable, but unless you are doing this to teach this stuff to yourself, I would advise against it in code you release.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Sweet god,
you folks need to update your ASM knowledge. With all due respect, this code is SLOW AS MOLASSE and is definately slower than the compiler-generated code. Look at SIMD for speed or simply stick to C/C++ API functions. Divs are damn slow but sqrt and mul are anything but impressive.

quote:
Original post by Charles B
You gave the answer yourself. Fstp pops the stack, fst does not. If you dont do it your function will end up with the FPU stack non empty. Execute your routine at most 7 more times (in fact less here) and you are sure to generate a stack overflow error. Next there is pointless, inefficient and even dangerous stuff in your code.

- x1=x; y1=y; z1=z; is pointless, x,y,z are already in memory, so no need to copy them on the (esp) stack.

- intermediate fstp should be absolutely avoided, if you code in asm just keep variables in registers (st(x)). First fstp adds latency and disorganizes instruction parallelism. Next it''s a clear immediate store and load depedency, very damageable. And last but not the least it''s totally unecessary.

- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

The code should simply be this one, I bet it''s at least at least 3 times faster :
__asm{
fld x
fmul
fld y
fmul
fld z
fmul
fxch st(2)
faddp st(1), st
faddp st(1), st
fsqrt
fld1
fdiv
fld x
fmul st(1) ; x, k
fld y
fmul st(2) ; y, x, k
fxch st(1) ; x, y, k
fld z ; z, x, y, k
fxch st(3) ; k, x, y, z
fmulp st(3), st
fstp x
fstp y
fstp z
}
//Done

Try to see what''s on the stack at each step. Your function must end up with a free stack since you return no float. That''s the problem you had. This will be your homework I don''t need it, I wrote this code in 30 seconds (I am trained to asm, any kind of). Anyway I would never use such a code myself. Try to google about quick reciprocical sqrt (rsqrt) approximations because fsqrt fdiv is a damn slow sequence. More your function is too small to parallelize instructions. So don''t use asm here.

Here is my advice : for such functions, keep your code C, inline of course. This enables your compiler to parallelize code when you "call" several small functions in sequence. In general keep asm for loops, such as array processing, where you can unroll. Only reimplement some stdlib like math.h functions to speed em up. Thus you will remove unecessary instructions the std lib adds for numerical precision and ANSI compliance. In general you don''t need it in 3D coding. What counts is FPS, and many std C/C++ libs include ultimately slow functions like memcpy, atoi, (int)float conversions, malloc, sqrt, etc ...

Use a profiler or a disassembly code in details once and you''ll be very surprised by the dirty work the compiler often does.





Share this post


Link to post
Share on other sites
SLOW AS MOLASSE and is definately slower than the compiler-generated code

K prove it. Gimme your C code or even asm code supposed to be faster than my sample. I''ll do a rdtsc of a x1024 loop. I take the bets. What''s good for the Pentium1 remains in general very efficient on any latest hardware, the reverse is false. Now if you are so up to date give me the exact analysis of the code scheduling on various new CPUs and justify in details what can be improved in the FPU scheduling since no instruction can be removed. Oh no need to tell that you don''t compare a FPU asm code with a C SIMD vectorized code. Else it''s another debate. A math lib I have written makes the vectorizers become antiquities.

muls are damn slow
Since when is fmul slow compared to fsqrt or fdiv ? It already costed only one cycle on the Pentium 1, same as fadd except a bit more latency. My code makes thel in // anyway, even without code reordering.

All the remarks I gave are still valid even on the latest hardware. The main one concerned register parameters.

- unnecessary instructions such as fstp fld necessarilly cost, ad this will always remain a general principle of good coding.
- store/load dependencies are more vivid than ever.
- AGIs

And this also applies to SIMD code.

SIMD

Figure this : I don''t need to update my knowledge at all concerning modern asm. I answered to a precise question, it was about FPU asm code, and I gave the fastest code for a non SIMD Intel, else of course anyone should use SIMD code.

I have just spend one month benching things in details, tweaking the compilers and instruction sets. So prove my knowlege is outdated. ROFL. I have created a math lib that is effectively between 100% 1000% faster than anything else I have tested. And I just need to write C files for any complex math routine, in a portable multiplatform way.

I have built a whole math lib with intrisics, asm, 3DNow, SSE, Altivec, etc... I already said I would not code this function in asm. I use macros and intrisics that use the fast RSqrt, else I use a Carmack style FPU RSqrt, much faster than sqrtf. I only code array processing routines in native SIMD. And only if I need them. All my basic math routines (dot,cross, etc..) work in 2-10 cycles on 3DNow, and less on SSE. No need to tell you that inline functions that don''t inline in std C++ won''t reach such perfs. Vectorizers won''t reach such perfs either.

C/C++ API
ROFL. The realm of 1000 cycles functions even when the core algo is counter++. when I start to count in cycles, in some intensive code I prefer to know what kind of code the CC generates. Even a "noob" like Carmack has been suprised many times by the std C/C++ APIs.

Now if you want to compete with any math code, please submit a challenge. And we''ll see how up to date you are compared to me
With all due respect ...

Share this post


Link to post
Share on other sites
Original post by Tramboi
quote:

What compilers would take the liberty to do such a transformation?
I guess a/k is in general different from a*(1.f/k) with IEEE754 floating point.
So you would need a switch such as --no-std-fp, no?



GCC certainly supports this and has for a number of years. Look at the options turned on by -ffast-math, which include various ways the compiler can improve math performance by relaxing complience to IEEE/ISO standards. Because of the standards compliance issues -02 & -03 don't enable -ffast-math, but in games such standards are unimportant and -ffast-math (or it's equivalent) is usually used with -02/-03 and platform/processor specific flags.

[edited by - johnb on December 19, 2003 4:47:07 PM]

Share this post


Link to post
Share on other sites
quote:

GCC certainly supports this and has for a number of years. Look at the options turned on by -ffast-math, which include various ways the compiler can improve math performance by relaxing complience to IEEE/ISO standards. Because of the standards compliance issues -02 & -03 don''t enable -ffast-math, but in games such standards are unimportant and -ffast-math (or it''s equivalent) is usually used with -02/-03 and platform/processor specific flags.



Oh I knew this for ee-gcc but I thought it was PS2-specific
I can be dumb sometimes!
I didn''t even know for VC "FP-consistency"!
It would have saved me a bit of time optimizing obvious stuff!


Share this post


Link to post
Share on other sites