math use fpu or 3dnow?

Started by
5 comments, last by Charles B 19 years, 3 months ago
Hello I have a water demo which consumes a lot of math (the grid is a projected grid so each vertex must be transformed two times in software, + i project a texture on it,+ I perturb the surface so I have to recalculate normals to the surface every time i update it). A real bottleneck. So I have tried rewriting the routines I had with 3dnow code, but I have never managed to get it to run much faster than normal fpu x87 code, why is that? even with amd 3d lib, it's code doesn't happen to be faster than fpu code, Has somebody yet managed to really take advantage of these instruction sets and How? I have an amd athlon and i code using athlon specific instructions.
Advertisement
x87 code, thats like the missing link between 32bit and 64bit ? ....
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I'm looking for work
The strength of SIMD is just that, Single Instruction Multiple Data, so one instruction can do 4 calculations in parallell. You need to design your code to take advantage of this, like for example transforming a vector by a matrix by multiplying the vector by an entire row per instruction.
Quote:Original post by JOL
x87 code, thats like the missing link between 32bit and 64bit ? ....

The 8087 was the original intel FPU.
Are you sure that you are computing normals in the right way for your grid?
(that is, you shouldn't be averaging triangle normals, you should be computing the normals directly from your grid)
I calculate normals by averaging them, that is I calculate the normal from each facet, add it to each vertex normal and then i divide each vertex normal by the number of passes, Why shouldn't it be done like that for a water surface?
You can definitely get huge gains (+100% speed up). Be certain to control your compiler settings well. If you use C intrisics, decrypt the asm output files generated, learn the timings from the AMD/Intel docs and see if your benchmarks (use a profiler, external or embeded in your code with rdtsc or GetTickCounts) indicates normal timings.

Q : do you use intrisics or inline asm ?

In my experience of 3DNow it's very rare, even in the less favorable cases (like cross product or quaternion muls), that I don't get a +50% speed up. Be certain to avoid anomalies due to bad C/C++ inlining, bad data alignment or function call overheads, etc... Now you get rid of one common artificial dead end : you are sure that you compare quality (truely optimized) code in x87 and in 3DNow or SSE.

Else I leave to you the job of a prior algorithmic analysis. Still I'd say that you should see considerable performance gains :

a) Cheap renormalization : The cheap (1 cycle throughput) rsqrt (without Newton Raphson refinement) gives you enough precision for the kind of output you have : RGBA with 8 bits per component in the end.

b) A regular grid brings increased parallelism. This gives you an implicit adjacency graph (mesh), and implicit x,y coords. You can exploit this structure to compute several (2,4 or 8) rows at once or else use loop unrolling. Be certain to use the register space fully, to schedule your operations well, to use a structure of array preferably. A regular grid has constant steps in x and y, assuming that only the z component changes in your surface animation model. Hence the most obvious optimization : do not compute dynamically constant elements.

If you want more tips, then post your current code, preferably, first, the naive x87 version.
"Coding math tricks in asm is more fun than Java"

This topic is closed to new replies.

Advertisement