problem using intrinsic sqrt with FPU programming

Started by
15 comments, last by shadow12345 20 years, 4 months ago
i think someone already commented on this, but u can''t do something like

mov eax, ptr->x

because it would require an extra instruction. in this case, ptr is a memory address that has to be resolved, and after resolution, the offset of x must be added to the resolved value:

mov eax, [ptr]
mov eax, [eax + offset]

this is why you should not use pointers unless its advantageous for some other reason(s). or if u have to use pointers, but your going to be working with the ''pointed-to'' data for a while, resolve the memory address once and keep using it (i.e. cache it). granted, the first time u resolve a memory address it will probably be moved into cpu cache and further de-referencing will be much faster than if it were kept in system memory, i just think it''s good practice.

in C, this is analagous to

void func(someStruct * RemoteStruct)
{
someStruct LocalStruct;
LocalStruct = *RemoteStruct;
// ...do lots of stuff with LocalStruct here.
}
Advertisement
Jorgander , part of what you say is true. But the real answer to Shadow is in my previous post. In the case of vectors somewhere in memory the best is to let the compiler store the pointer in a register before the call by setting the appropriate calling coventions.

Else about copying the structure locally.

Well in some cases where it's worth doing it. But copy standard type variables rather than whole structs. It can be relevant for register pressure, register allocation, and when else you would have to access multiple pointers, specially in a loop. But there are drawbacks too. Here a good example :

// Very good practice :
int n1 = pMesh1->numVerts;
int n2 = pMesh2->numVerts;
//...
// Some loop.

Because if n1,n2 can fit into registers, that's good. It's also a valuable undirect information for the compiler. For instance these two instructions do not have exactly the same meaning :

if(inumVerts)
if(i
This :

solves pointer aliasing
pMesh1->numVerts could be accessed through pointer aliasing. Through a g_pNumVerts1 somewhere it could be modified. The compiler has no way to warrant that it won't happen in most cases, thus he keeps accessing the memory every time you use the data. You can set the various compiler directives, but some compiling paths will not have enough information. Since n1 is local (on the stack or in a register) the compiler knows there is no aliasing possible if none is detected inside the function scope. Thus can keep the data in a register as long as possible.

diminishes register pressure.
You free two registers, one for pMesh1 and one for pMesh2. These two more registers can greatly improve tight inner loops.

makes your syntax more readable
More concise, less cumbersome, this clearly shows which are the actual structure variables used.

But this practice is nonsense for small vector functions. There is only one pointer. Use __fastcall or __thiscall, avoid unecessary preambles, AGIs, and store load dependencies (*) . Modern CPUs have huge store/load dependencies. What's the point in adding unecessary load (pointed data), then store (local structure), then read (local structure) with a lot of stalls (*).


[edited by - Charles B on December 18, 2003 12:31:10 AM]
"Coding math tricks in asm is more fun than Java"
Hey,
The answer is already written within theses pages. I just wanted to point out that the book you are currently using is really darn old. The reality is that it is faster to compute a sqrt with what MS VC .NET provides. If you really want speed, you have to go SIMD and compute the inverse square root which you can multiply by x. Your taclking of asm is honorable, but unless you are doing this to teach this stuff to yourself, I would advise against it in code you release.
Sweet god,
you folks need to update your ASM knowledge. With all due respect, this code is SLOW AS MOLASSE and is definately slower than the compiler-generated code. Look at SIMD for speed or simply stick to C/C++ API functions. Divs are damn slow but sqrt and mul are anything but impressive.

quote:Original post by Charles B
You gave the answer yourself. Fstp pops the stack, fst does not. If you dont do it your function will end up with the FPU stack non empty. Execute your routine at most 7 more times (in fact less here) and you are sure to generate a stack overflow error. Next there is pointless, inefficient and even dangerous stuff in your code.

- x1=x; y1=y; z1=z; is pointless, x,y,z are already in memory, so no need to copy them on the (esp) stack.

- intermediate fstp should be absolutely avoided, if you code in asm just keep variables in registers (st(x)). First fstp adds latency and disorganizes instruction parallelism. Next it''s a clear immediate store and load depedency, very damageable. And last but not the least it''s totally unecessary.

- your asm code does three fdivs !!! Don''t you know how slow these instructions are ? Even your C version will be optimized because the compiler will create a temp variable tmp=1/magnitude and will do three muls instead of 3 divs. so either choose to learn asm and read docs seriously, else keep coding C, your code will be faster. One has to begin tho, and you have an altruist teacher here.)

The code should simply be this one, I bet it''s at least at least 3 times faster :
__asm{
fld x
fmul
fld y
fmul
fld z
fmul
fxch st(2)
faddp st(1), st
faddp st(1), st
fsqrt
fld1
fdiv
fld x
fmul st(1) ; x, k
fld y
fmul st(2) ; y, x, k
fxch st(1) ; x, y, k
fld z ; z, x, y, k
fxch st(3) ; k, x, y, z
fmulp st(3), st
fstp x
fstp y
fstp z
}
//Done

Try to see what''s on the stack at each step. Your function must end up with a free stack since you return no float. That''s the problem you had. This will be your homework I don''t need it, I wrote this code in 30 seconds (I am trained to asm, any kind of). Anyway I would never use such a code myself. Try to google about quick reciprocical sqrt (rsqrt) approximations because fsqrt fdiv is a damn slow sequence. More your function is too small to parallelize instructions. So don''t use asm here.

Here is my advice : for such functions, keep your code C, inline of course. This enables your compiler to parallelize code when you "call" several small functions in sequence. In general keep asm for loops, such as array processing, where you can unroll. Only reimplement some stdlib like math.h functions to speed em up. Thus you will remove unecessary instructions the std lib adds for numerical precision and ANSI compliance. In general you don''t need it in 3D coding. What counts is FPS, and many std C/C++ libs include ultimately slow functions like memcpy, atoi, (int)float conversions, malloc, sqrt, etc ...

Use a profiler or a disassembly code in details once and you''ll be very surprised by the dirty work the compiler often does.




SLOW AS MOLASSE and is definately slower than the compiler-generated code

K prove it. Gimme your C code or even asm code supposed to be faster than my sample. I''ll do a rdtsc of a x1024 loop. I take the bets. What''s good for the Pentium1 remains in general very efficient on any latest hardware, the reverse is false. Now if you are so up to date give me the exact analysis of the code scheduling on various new CPUs and justify in details what can be improved in the FPU scheduling since no instruction can be removed. Oh no need to tell that you don''t compare a FPU asm code with a C SIMD vectorized code. Else it''s another debate. A math lib I have written makes the vectorizers become antiquities.

muls are damn slow
Since when is fmul slow compared to fsqrt or fdiv ? It already costed only one cycle on the Pentium 1, same as fadd except a bit more latency. My code makes thel in // anyway, even without code reordering.

All the remarks I gave are still valid even on the latest hardware. The main one concerned register parameters.

- unnecessary instructions such as fstp fld necessarilly cost, ad this will always remain a general principle of good coding.
- store/load dependencies are more vivid than ever.
- AGIs

And this also applies to SIMD code.

SIMD

Figure this : I don''t need to update my knowledge at all concerning modern asm. I answered to a precise question, it was about FPU asm code, and I gave the fastest code for a non SIMD Intel, else of course anyone should use SIMD code.

I have just spend one month benching things in details, tweaking the compilers and instruction sets. So prove my knowlege is outdated. ROFL. I have created a math lib that is effectively between 100% 1000% faster than anything else I have tested. And I just need to write C files for any complex math routine, in a portable multiplatform way.

I have built a whole math lib with intrisics, asm, 3DNow, SSE, Altivec, etc... I already said I would not code this function in asm. I use macros and intrisics that use the fast RSqrt, else I use a Carmack style FPU RSqrt, much faster than sqrtf. I only code array processing routines in native SIMD. And only if I need them. All my basic math routines (dot,cross, etc..) work in 2-10 cycles on 3DNow, and less on SSE. No need to tell you that inline functions that don''t inline in std C++ won''t reach such perfs. Vectorizers won''t reach such perfs either.

C/C++ API
ROFL. The realm of 1000 cycles functions even when the core algo is counter++. when I start to count in cycles, in some intensive code I prefer to know what kind of code the CC generates. Even a "noob" like Carmack has been suprised many times by the std C/C++ APIs.

Now if you want to compete with any math code, please submit a challenge. And we''ll see how up to date you are compared to me
With all due respect ...
"Coding math tricks in asm is more fun than Java"
Original post by Tramboi
quote:
What compilers would take the liberty to do such a transformation?
I guess a/k is in general different from a*(1.f/k) with IEEE754 floating point.
So you would need a switch such as --no-std-fp, no?


GCC certainly supports this and has for a number of years. Look at the options turned on by -ffast-math, which include various ways the compiler can improve math performance by relaxing complience to IEEE/ISO standards. Because of the standards compliance issues -02 & -03 don't enable -ffast-math, but in games such standards are unimportant and -ffast-math (or it's equivalent) is usually used with -02/-03 and platform/processor specific flags.

[edited by - johnb on December 19, 2003 4:47:07 PM]
John BlackburneProgrammer, The Pitbull Syndicate
quote:
GCC certainly supports this and has for a number of years. Look at the options turned on by -ffast-math, which include various ways the compiler can improve math performance by relaxing complience to IEEE/ISO standards. Because of the standards compliance issues -02 & -03 don''t enable -ffast-math, but in games such standards are unimportant and -ffast-math (or it''s equivalent) is usually used with -02/-03 and platform/processor specific flags.


Oh I knew this for ee-gcc but I thought it was PS2-specific
I can be dumb sometimes!
I didn''t even know for VC "FP-consistency"!
It would have saved me a bit of time optimizing obvious stuff!


This topic is closed to new replies.

Advertisement