Back to General and Gameplay Programming

Optimizing a divide

Crispy · 2004-06-02T05:31:37

Since I know very little about the internal workings of a CPU architecture, but I am nevertheless facing a very old problem, I decided to ask around a little. I need to implement as fast divides as possible for the denominators 11025, 22050, 44100 and 96000. If anyone can post relevant IEEE floating point standard compliant code (which I can''t write myself), it''d best, but I fully realize that''s a little arrongant for me to expect. However, since I''m also a wee bit lazy (to write a benchmark) and don''t have access to too many different CPU types/builds, I''ll also pop the simpler question here: which is faster - a division by a number or a multiplication by the precalculated constant of one divided by that number? Also, what implications might this choice have on different CPU architectures (I only have access to AthlonXP)?

General and Gameplay Programming Programming

Started by Crispy May 31, 2004 01:28 PM

24 comments, last by Crispy 19 years, 10 months ago

Charles B

863

June 01, 2004 10:47 AM

It''s false, unless you still work on a 486 or unless you use SIMD, iu ops (muls) are not faster than fpu ops (muls). They are even far slower on most machines.

"Coding math tricks in asm is more fun than Java"

DukeAtreides076

602

June 01, 2004 10:48 AM

quote:Original post by Anonymous Poster
(AP #3)
You could also create your own fixed point type rather than floating point. In essence it''s just a long that uses the bottom 8 bits for the decimal always, so it''s a lot faster than a float.

It overflows alot faster than a float too. Its really not that huge of a performance difference to stick use floats on modern processors. The SIMD multiplication opcodes have a latency of 7 or 6 and a throughput of 2; fmul is a latency of 8 or 7 with a throughput of 2. The floating point divides have a latency around 30+.

The general purpose MUL instruction runs with a latency of anywhere from about 10-18 and DIV has 66-80. FP operations can be also computed side by side with ALU operations.

anubisX

122

June 01, 2004 10:52 AM

quote:Original post by Charles B
It''s false, unless you still work on a 486 or unless you use SIMD, iu ops (muls) are not faster than fpu ops (muls). They are even far slower on most machines.

well, i didn''t want to be that rude :D

Etnu

880

June 01, 2004 02:23 PM

quote:
Only 10 mega flops !!! LOL Hopefully even a K6 could do much more. This tastes like benchmarks in heavilly unoptimized debug mode. I would say that the max speed is something like 16 giga flops "div by constant=mul" per second on a 2GHz SSE capable machine.

Actually it was in debug mode, added as a test in the middle of my *RENDERING LOOP* on another program.

I was just trying to illustrate the point that he''s worrying about something he shouldn''t be worrying about.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

temp_ie_cant_thinkof_name

100

June 01, 2004 06:04 PM

If you need to constantly divide by these numbers, then convert them to reciprocal at compile time and use fp multiplication. However, to be fast, you don''t want to keep loading these constants from memory. So make you own multiplication routine, and don''t pop the fpu stack when you load the operand back into memory:

//global:float denominators[4] = {1.0f / 11025.0f, //Let compiler store fp                         1.0f / 22050.0f, //for you.                         1.0f / 44100.0f,                         1.0f / 96000.0f};//Called at beginning of some segment of code that needs to //divide by given constants a lot.  Inlined b/c it''s only called//once, ideally.inline void LoadConstantsOnFPUStack(void){   _asm {      fld denominators[0]      fld denominators[1]      fld denominators[2]      fld denominators[3]   }}

But since the x87 fpu stack has only 8 registers (r0 to r7), this may lead to a problem if other routines need to use the fpu. Solution: make the routines in need save the fpu state and then restore it afterward:

//global:char FPUState[108];//These two functions called if something else//needs more than 4 fpu registers.  inline void SaveFPURegisters(void){   _asm {      fnsave FPUState   }}inline void RestoreFPURegisters(void){   _asm {      frstor FPUState   }}

However, if another routine doen''t need more than 4 fpu registers, make sure the callee procedure balances the stack back, so that the top-of-stack pointer will be back where the value 960000 is. But since, it''s ambiguous as to wheter C++ does this implicitly, it''s probably a safer idea to save/restore the fpu state.
Now to multiply by your reciprocated denominators you could do this:

template<class T>//For 96000 which is at st(0).inline T multiply(T value){   _asm {      fld value  //Value pushed onto stack, aliased to st(0).                 //96000 goes to st(1).      fmul st(1) //st(0) = st(0) * st(1).      fstp value //Put st(0) in value and pop the stack back                 //to its original state.   }   return value;}//And then do 3 more templated routines.  One for 11025, 22050, 44100, by replacing this line "fmul st(1)",with "fmul st(i)",where i is 4, 3, and 2, respectively.

However, you said you wanted fast? Well in that case you''d have to right your own fpu code if you wanted to do several consecutive divisions by those constants. If you have to do severy divisions at once, even by different numbers, use SSE and the xmm registers. The IA-32 SSE instructions are valid on any Pentium III and later processor.

"I study differential and integral calculus in my spare time." -- Karl Marx

Charles B

863

June 02, 2004 05:31 AM

Well thx for the effort, temp. But loading the st(0-3) registers is really not a good idea.

- I suppose he only needs one constant per loop, not 4.

- fop reg, mem are as speedy as fop reg, reg, specially if mem is in the L1 cache which is obviously the case for a constant loaded repeatedly. The only drawback may be code size, which can marginally make the code a bit slower.

- forcing constants into registers raises register pressure, high with only 8 registers and reduces scheduling possibilities for the compiler or the human asm code writer

- last, he said he does not know asm. Thus it won''t be very useful anyway.

- as for the template/asm stuff. I could explain in detail why it would lead to a less efficient code than a completly C++ code. Specially if the loops are unrolled. You freeze the names of the FPU registers (st(0) then st(1) once value is loaded). Thus you prevent two input values (one in st(1) and one in st(0))being processed in //. Only the hardware code reordering will make it not too dramatic. But on a Pentium Plain, for instance, the code would be very poor, each instruction would wait for the max latency of the previous.

More generally a rule of tumb : inline asm is pointless with Visual C++, because Visual C++ does not rename the registers he finds inside asm blocks, it never touches them. The only occasions were __asm{} is relevant in Visual C++ are routines that are big enough, example :
- a matrix*matrix multiplication
- an array processing loop
- never for such a small function such as this "multiply".

BTW GCC is far more efficient than Visual for low level coding, mixing C++ and asm.

So once again : let the compiler work here unless you are able to write the complete unrolled and scheduled loop in asm. The first thing is knowing the optimization options of the compiler.

"Coding math tricks in asm is more fun than Java"

Optimizing a divide

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Optimizing a divide

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines