• Popular Now

• 15
• 15
• 11
• 9
• 10

Archived

This topic is now archived and is closed to further replies.

problem using intrinsic sqrt with FPU programming

This topic is 5208 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

I'm just fiddling around with FPU programming. I've tried this code with and without the 'generate intrinsic functions' setting in the visual C++ 6 compiler. this code does not work properly for some reason, and I'm not sure why This is my first day of FPU stuff. I'm using Tricks of the 3D game programming GURUS as a resource (LaMothe), pages 456+
void	Vector3::Normalize()
{

#if	1
float	basic = x*x + y*y + z*z;

if(basic>.0001)
{
_asm
{
FLD	basic; //Put basic length into st(0)

FSQRT;		//Put SQUARE ROOT of basic into st(0)

FST basic; //Store square root of basic into memory address of basic

}
//	if(basic) //won't ever be exactly zero

//	{

//Can't seem to access this->x,y,or z

float	x1 = x;
float	y1 = y;
float	z1 = z;
_asm
{
FLD	x1;			//Load into FPU stack element zero

FDIV	basic;	//divide by length

FSTP	x1;		//store in x1 and pop stack

FLD	y1;
FDIV	basic;
FSTP	y1;

FLD z1;
FDIV	basic;
FSTP	z1;
}
x =	x1;
y = y1;
z = z1;
//}

}
#else

float	magnitude = sqrt( (x * x) + (y * y) + (z * z) );
if(magnitude)
{
x /= magnitude;
y /= magnitude;
z /= magnitude;
}

#endif
}
EDIT:: Okay, I really don't know what i'm doing. I changed the line of code: FST basic; //Store square root of basic into memory address of basic to this: FSTP basic; //Store square root of basic into memory address of basic the only difference is the p, which denotes that it pops the stack. I'm not really sure why that is significant [edited by - Shadow12345 on December 15, 2003 3:59:26 PM] [edited by - Shadow12345 on December 15, 2003 4:02:16 PM] [edited by - Shadow12345 on December 15, 2003 4:02:44 PM]

Share on other sites
You gave the answer yourself. Fstp pops the stack, fst does not. If you dont do it your function will end up with the FPU stack non empty. Execute your routine at most 7 more times (in fact less here) and you are sure to generate a stack overflow error. Next there is pointless, inefficient and even dangerous stuff in your code.

- x1=x; y1=y; z1=z; is pointless, x,y,z are already in memory, so no need to copy them on the (esp) stack.

- intermediate fstp should be absolutely avoided, if you code in asm just keep variables in registers (st(x)). First fstp adds latency and disorganizes instruction parallelism. Next it''s a clear immediate store and load depedency, very damageable. And last but not the least it''s totally unecessary.

- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

The code should simply be this one, I bet it''s at least at least 3 times faster :
__asm{
fld x
fmul
fld y
fmul
fld z
fmul
fxch st(2)
fsqrt
fld1
fdiv
fld x
fmul st(1) ; x, k
fld y
fmul st(2) ; y, x, k
fxch st(1) ; x, y, k
fld z ; z, x, y, k
fxch st(3) ; k, x, y, z
fmulp st(3), st
fstp x
fstp y
fstp z
}
//Done

Try to see what''s on the stack at each step. Your function must end up with a free stack since you return no float. That''s the problem you had. This will be your homework I don''t need it, I wrote this code in 30 seconds (I am trained to asm, any kind of). Anyway I would never use such a code myself. Try to google about quick reciprocical sqrt (rsqrt) approximations because fsqrt fdiv is a damn slow sequence. More your function is too small to parallelize instructions. So don''t use asm here.

Here is my advice : for such functions, keep your code C, inline of course. This enables your compiler to parallelize code when you "call" several small functions in sequence. In general keep asm for loops, such as array processing, where you can unroll. Only reimplement some stdlib like math.h functions to speed em up. Thus you will remove unecessary instructions the std lib adds for numerical precision and ANSI compliance. In general you don''t need it in 3D coding. What counts is FPS, and many std C/C++ libs include ultimately slow functions like memcpy, atoi, (int)float conversions, malloc, sqrt, etc ...

Use a profiler or a disassembly code in details once and you''ll be very surprised by the dirty work the compiler often does.

Share on other sites
yeah i know you're supposed to multiply by the reciprocal, i just wanted the damn thing to work, i didn't care how.
I can't directly load x onto the stack.

this error is generated when I do this:
FLD x
FLD x1

quote:

error C2420: 'x' : illegal symbol in first operand

your code generates the same errors. it might be because of the fact that I'm not doing this on an intel machine, lol. I didn't actually think about that until after I coded it and read in the book that this is all for intel machines...

[edited by - Shadow12345 on December 16, 2003 4:02:53 PM]

[edited by - Shadow12345 on December 16, 2003 4:03:52 PM]

[edited by - Shadow12345 on December 16, 2003 4:04:54 PM]

Share on other sites
quote:
Original post by Charles B
- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

What compilers would take the liberty to do such a transformation?
I guess a/k is in general different from a*(1.f/k) with IEEE754 floating point.
So you would need a switch such as --no-std-fp, no?

BTW, with GCC inline asm syntax with inputs/outputs/clobber, you can often code this kind of functions in a real good way, with the compiler having good insight of the side effects, allowing it to schedule your assembly. The syntax is a bit hard to learn, but I just love it (works like a charm on PS2 EE & VU0, you can have the compiler manage the SIMD register allocation, and have very good general code without too much effort).

ie.

[cpp]
inline u_long128 AddVector4 (u_long128 a, u_long128 b)
{
register u_long128 ret;
asm ("vadd %0, %1, %2" : "=j"(ret) : "j"(a), "j"(b));
return ret;
}
[/cpp]

If the u_long128 you pass as a parameter is in memory, GCC will generate code that will load it in a "j" (macroVU0) register.
If it is in a GPR, it will do it too in another way.
And best, if those are already in registers, it will just call vadd!
(The same for outputs)

Share on other sites
do either of you know why i am getting the error with trying to access the xy and z components in the vector class? They're private, and I access the x,y,z components from anywhere else in my program using normal c++ code

[edited by - Shadow12345 on December 16, 2003 9:31:24 PM]

Share on other sites
quote:
- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

Multiple Pipelines - divides CAN be as fast as multiplies (in practive) if you do it correctly. Don''t forget that the pentium and above (includes k7''s etc) do this well. BTW - the Athlon I use when fully optimised takes only 5% more time when doing divides over multiples when using two pipes.

In a single pipe you are spot on though. The above athlon is giving about 20 times (forgot the exact number) as many clock ticks as the multiply.

Share on other sites
oh, and here is something that you should learn if you want to make fast functions, especailly using assemby

__declspec(naked) __fastcall int func(int a,int b,int c)
{
__asm
{
xor eax,eax
ret
}
}

For compatibility sake, if you have both use __msfastcall, (if your on msvc you only have this one - it''s called __fastcall)

MSVC style __fastcall (__msfastcall for the other compilers) returns the first two parameters into the LAST two registers

The normal style __fastcall (__fastcall) returns the first three parameters in the the FIRST three registers. - msvc doesn''t have this version of the fastcall function - they only have the __msfastcall version.

Using declspec(naked) tells the compiler not to create the functions entry and exit points etc. This means the stack space, the code it starts with and the closing code. You have to do it yourself. BUT it is the fastest way to call any function.

Share on other sites
thank you all for posting that information, it''s very helpful to know these things. However I still cannot compile the code, I get the error above. The error occurs when I try to fld the x y or z member of the vector class into the register. x y and z are public members of the class (i don''t use private).

Share on other sites
quote:
thank you all for posting that information, it''s very helpful to know these things. However I still cannot compile the code, I get the error above. The error occurs when I try to fld the x y or z member of the vector class into the register. x y and z are public members of the class (i don''t use private).

This is a limitation to the inline assembly syntax with members :-/

It understands ''.'' but not ''->'' so you can''t use this->x.
It doesn''t understand ''(*this).'' either.

Easy way:
You can work around it by doing something like this
[cpp]
const Class& rthis = *this;
__asm {
fld [rthis.x];
}
[/cpp]
But you''d better verify the generated code is not too dumb for your purpose.

Harder way:
Read the offset in assembly and so on...

Better way:
Use GCC 3.2 and its great asm syntax (great by power, not great by %eax% % shit!)
(But I prefer MS IDEs to VI, emacs and so on )

Hope this helps

Bertrand

Share on other sites
Yeah right I did not try the code thus I did not explain the fld x stuff in detail. you are right you can only access local objects with inline asm. Thus data stored on the stack and pointed by esp or ebp. In fact what you need is an integer register to hold the this pointer. This depends on your calling convention. If I remember well Visual C++ arcanes __thiscall for C++ puts this in ebx and in ecx for the equivalent __fastcall C function. See the docs.

So the code becomes :

fld [ecx]
...
fld [ecx+4]
...
etc..

With __stdcall or __cdecl you would have to load the register from the stack, add something like that in preamble :

mov ecx, this;

Else about fdivs, yes the compiler can do these optimizations, it's conditionned by the floating point directive : floating point consistency. I have already speced that with VisualC++ many times.

Now about 3 fdiv latencies working in parallel I am not certain about it. I am sure it will not be faster on the latest CPUs and I am sure it will be much slower on older CPUs. Anyway again what must be done here is a NR approximation of RSqrt which will remain much faster than the fsqrt fdiv on any processor I have tested for my math lib. You can reach a 24bis precision with 2 or 3 NR iterations.

[edited by - Charles B on December 18, 2003 11:59:10 PM]