Archived

This topic is now archived and is closed to further replies.

Optimizing a divide

This topic is 4944 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Since I know very little about the internal workings of a CPU architecture, but I am nevertheless facing a very old problem, I decided to ask around a little. I need to implement as fast divides as possible for the denominators 11025, 22050, 44100 and 96000. If anyone can post relevant IEEE floating point standard compliant code (which I can''t write myself), it''d best, but I fully realize that''s a little arrongant for me to expect. However, since I''m also a wee bit lazy (to write a benchmark) and don''t have access to too many different CPU types/builds, I''ll also pop the simpler question here: which is faster - a division by a number or a multiplication by the precalculated constant of one divided by that number? Also, what implications might this choice have on different CPU architectures (I only have access to AthlonXP)?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Find a problem before you try to solve it. Seriously.

If you have X hours to spend on optimizing your application, it''s best spending those hourse on the bottlenecks. If you see that the division turns up as a bottleneck in your application when you run a profiler, then you can start looking into this.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Crispy
Since I know very little about the internal workings of a CPU architecture, but I am nevertheless facing a very old problem, I decided to ask around a little.

I need to implement as fast divides as possible for the denominators 11025, 22050, 44100 and 96000. If anyone can post relevant IEEE floating point standard compliant code (which I can''t write myself), it''d best, but I fully realize that''s a little arrongant for me to expect.

However, since I''m also a wee bit lazy (to write a benchmark) and don''t have access to too many different CPU types/builds, I''ll also pop the simpler question here: which is faster - a division by a number or a multiplication by the precalculated constant of one divided by that number? Also, what implications might this choice have on different CPU architectures (I only have access to AthlonXP)?


To answer the second question, multiplication is always faster than division. The truth is that if you''re just compiling code normally, the compiler knows what optimisations can be undertaken and generally will apply them if it is likely to be acceptable; case in point, the MSVC 7 compiler definatelly spots and acts upon oportunites to change divisions to multiplications of the inverse - I don''t know about other compilers, but I''d be supprised if any current major compilers don''t do it (besides, MSVC 7 can now be got for free). As a general rule keep your code as clean as possible & hope your compiler will do the hard work for you, then only when you''ve established that it can''t spot an optimisation that you have verified through profiling that you actually need re-write the relevent line or two. What happens next year when a suttle change in architecture makes the optimisation a hinderance rather than a help ?? (it happens...)

In the case of the first part of the question, what exactly are you after? so integer code to handle the FP ops? to be honest I''d suggest the rather traditional
c = a / b; 
100% standards compilant, gets translated to exactly 1 opcode on any architecture, integer code to do the work is unlikely to be faster ? what''s wrong with it ? use floats if doubles are too slow for your purposes...

Why not mention what you''re actually trying to do? perhaps a better method over-all can be suggested.

Share this post


Link to post
Share on other sites
AP1: I''ve been coding for 6-7 years now and I rarely post a stupid question related to programming itself. This is not such a case - I''ve seen hundreds of (okay, that''s a gross overstatement) similar threads to this one all of which get similar replies. I am perfectly aware of the unwritten code of conduct (pun intended) related to optimizations. However, I also mentioned that I hardly know anything about CPU architecture, which suggests I know very little about Assembly (which is the case) to write the optimizations myself where I recognize that one truly is needed.

The numbers 11025, 22050, 44100 and 96000 are standard sampling rates for audio signals, so a little bit of deductive thinking would suggest I''m dealing with substantial amounts of devisions as opposed to some negligible few per second (in my case it can be up to 200-300 thousand (per second), depending on the number of distict signals).

I also mentioned the IEEE floating point standard, which also mildly suggests that I''m dealing with single precision floating point operations, not integer or double precision floating point operations.

The reason I asked is that I am not sure whether introducing a 1/n multiplicand would weigh up the simpler use of a division if the division can be accomplished just as fast on most CPU''s (to me this would translate to one instruction or something like that). Thanks for answering that, AP2.

As for suggesting a better method - I don''t see a way to speed up signal normalization for a second''s period through means other than dividing all the samples in it by the sampling rate.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
A quick look on the internet suggests that the latency for an FDIV instruction (ie fpu''s div inst) is 17 cycles for a P6 (fmul = 5), assuming that you''ve got 300,000 of them, then you''re talking about 5.1 million cycles being devoted to the division out of what ? say 1000 million (ie a GHz) ? say 0.51% of cpu resources ? its simply not worth worrying about, unless you''ve written a program & established that this really is an issue.

The other thing worth noting is that the above paragraph is complete bollocks, in that current architectures can significantly mitigate this cost through multiple pipelines - hence the numbers mean very little except perhaps an upper-bound. If you''re dealing with large volumes of data you''re bottle-neck it very probably IO.

Share this post


Link to post
Share on other sites
if you''re really only dividing by: 11025, 22050, 44100 and 96000. then why not:


static const float oneOver11025 = 1.0f/22050.0f;
static const float oneOver22050 = 1.0f/22050.0f;
static const float oneOver44100 = 1.0f/44100.0f;
static const float oneOver96000 = 1.0f/96000.0f;

float newNumber = originalNumber * oneover11025;
//etc, etc, etc...



but i also agree with all the other people. unless a profiler has told you that this particular function/division is really _the_ thing that is slowing down your app, you are wasting your time optimising it.

-me

Share this post


Link to post
Share on other sites
You should consider using SSE and 3DNow. These are meant for streaming like this.

In the case of SSE you can handle 4 samples at once
(4 reciprocal multiplications).

And if you''re really after speed, then you should consider
prefetching and non temporal stores with SSE & 3DNow.
This will boost your code.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Crispy
AP1: I''ve been coding for 6-7 years now and I rarely post a stupid question related to programming itself. This is not such a case - I''ve seen hundreds of (okay, that''s a gross overstatement) similar threads to this one all of which get similar replies. I am perfectly aware of the unwritten code of conduct (pun intended) related to optimizations. However, I also mentioned that I hardly know anything about CPU architecture, which suggests I know very little about Assembly (which is the case) to write the optimizations myself where I recognize that one truly is needed.

The numbers 11025, 22050, 44100 and 96000 are standard sampling rates for audio signals, so a little bit of deductive thinking would suggest I''m dealing with substantial amounts of devisions as opposed to some negligible few per second (in my case it can be up to 200-300 thousand (per second), depending on the number of distict signals).

I also mentioned the IEEE floating point standard, which also mildly suggests that I''m dealing with single precision floating point operations, not integer or double precision floating point operations.

The reason I asked is that I am not sure whether introducing a 1/n multiplicand would weigh up the simpler use of a division if the division can be accomplished just as fast on most CPU''s (to me this would translate to one instruction or something like that). Thanks for answering that, AP2.

As for suggesting a better method - I don''t see a way to speed up signal normalization for a second''s period through means other than dividing all the samples in it by the sampling rate.
Hello, I''m the "AP1". Too many APs here I think

As AP2 said, compilers are great today. It''s rare that you can do something smarter than the compiler in the general case.

I didn''t mean to offend you, sorry if I did. Still I think my point is valid. Optimize after you have found the bottlenecks, not before. Even in the case you describe here I doubt you''ll see it in the profiler. You will probably have other bottlenecks instead. Spend the time on what you see in the profiler. There are tons of litterature written about this, it''s not just me having this opinion.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Foe what purpose do you want to divide all the samples by the sampling rate?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Anonymous Poster
As AP2 said, compilers are great today. It''s rare that you can do something smarter than the compiler in the general case.


AP2, again! to qualify that, compilers today do a great job of micro-optimisations, which is what most people on this board tend to thing of when they say ''optimise'', what they can''t do is figure out a better algorithm for you (yet!). Keep the code clean, the compiler will do the nasty bits for you, and try to think_of/spot a better way to do what you want, at the worst you''ll end up with clear code that anyone can read & which is therefore better for maintainance.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by DrGUI
Can you use .NET with SSE and 3DNow, now that asm{} has been removed?


AFAIK you can''t use them explicitly, I think that (in theory) the JIT compiler is meant to spot opportunities for vectorisation (well, if it does happen, then that is where it will be).

Share this post


Link to post
Share on other sites
Palidine posted the answer. Instead of dividing by a constant, you can multiply by its reciprocal. It can''t really get any faster than that.

But as the first AP said, you should really profile your code. Chances are that that division isn''t slowing it down at all. Is it within your innermost loop? Is it really that important to perform so many of those iterations per second, or are you getting the data from a source that provides it at a slow rate? Even if your compiler doesn''t do the optimization for you, with pipelining there could be very little effect on the overall performance.

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
Foe what purpose do you want to divide all the samples by the sampling rate?


The number is actually greater than that (300000) in some occasions - this is a gross estimate I came up with - and it''s for spectral analysis.

Matei: I suggested it in the OP myself as well...

Etnu: I''m aiming for CPU usage of less than 1% (currently I''m at 3-6%) on a 1GHz Athlon. These divides aren''t the only bottleneck I know are around, but they do constitute a more major one.

For that reason the AP who said that 0.51% CPU usage is negligble, is (in this case at least) wrong - I can''t have one loop alone suck up more than half of the dedicated resources for my program.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Crispy
quote:
Original post by Anonymous Poster
Foe what purpose do you want to divide all the samples by the sampling rate?


The number is actually greater than that (300000) in some occasions - this is a gross estimate I came up with - and it''s for spectral analysis.

Matei: I suggested it in the OP myself as well...

Etnu: I''m aiming for CPU usage of less than 1% (currently I''m at 3-6%) on a 1GHz Athlon. These divides aren''t the only bottleneck I know are around, but they do constitute a more major one.

For that reason the AP who said that 0.51% CPU usage is negligble, is (in this case at least) wrong - I can''t have one loop alone suck up more than half of the dedicated resources for my program.


You didn''t answer the question though - many of us are curious as to why this is necessary. One usually multiplies/divides samples by a volume coefficient, sure, but the sampling rate? What gives?

Share this post


Link to post
Share on other sites
If you are dividing by a constant then I would expect even a poor compiler to replace it with multiplication by the correct constant.

I could be wrong. Anyway, read the generated code and profile before making any ''optimisations''.

Share this post


Link to post
Share on other sites
"1,000,000 floating point divide operations, on a Athlon XP 2000+ takes < 100ms"

Only 10 mega flops !!! LOL Hopefully even a K6 could do much more. This tastes like benchmarks in heavilly unoptimized debug mode. I would say that the max speed is something like 16 giga flops "div by constant=mul" per second on a 2GHz SSE capable machine.

Crispy in such cases, you should have written a few lines of code samples to make it clearer.

But since I assume you are talking of floating point data (if these are integers just tell, it's fundamental) and I am certain all you do is scaling an array, the division is not really the issue :

Don't worry if your code is just :

for(){ r[ i ] = s[ i ]/scale }.


The compiler will optimize everything in release mode. But to be certain you can just unroll and cache the reciprocical yourself :


r[ i+0 ] = s[ i+0 ]*inv_scale;
r[ i+1 ] = s[ i+1 ]*inv_scale;
r[ i+2 ] = s[ i+2 ]*inv_scale;
r[ i+3 ] = s[ i+3 ]*inv_scale;


The next step is if your compiler has vectorizing caps, it will be obvious for it to output SSE or 3DNow code. And then again if you want to be certain of optimal speed you can still use vector intrisics for VisualC++. The equivalent also exists for GCC.

Last, you could wait for the release of my math lib. Your problem will then just be one function call to one of my myriads of ultimately optimized functions :
div(pOut, n, pSrc, scale);


[edited by - Charles B on June 1, 2004 11:18:57 AM]

[edited by - Charles B on June 1, 2004 11:22:04 AM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
(AP #3)
You could also create your own fixed point type rather than floating point. In essence it''s just a long that uses the bottom 8 bits for the decimal always, so it''s a lot faster than a float. It''s easily sorted with a few macros (in C):


typedef signed long fixed;

#define mulfx(x,y) ((x*y)>>8) // multiply x and y

#define divfx(x,y) ((x<<8)/y) // divide x by y


#define itof(x) (x<<8) // convert an int to a fixed

#define ftoi(x) (x>>8) // convert a fixed to an int



If you need greater precision, change the 8s to something larger. You can write a class to handle all the for you in C++ but I haven''t needed one in C++ yet.

Share this post


Link to post
Share on other sites
It''s false, unless you still work on a 486 or unless you use SIMD, iu ops (muls) are not faster than fpu ops (muls). They are even far slower on most machines.

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
(AP #3)
You could also create your own fixed point type rather than floating point. In essence it''s just a long that uses the bottom 8 bits for the decimal always, so it''s a lot faster than a float.


It overflows alot faster than a float too. Its really not that huge of a performance difference to stick use floats on modern processors. The SIMD multiplication opcodes have a latency of 7 or 6 and a throughput of 2; fmul is a latency of 8 or 7 with a throughput of 2. The floating point divides have a latency around 30+.

The general purpose MUL instruction runs with a latency of anywhere from about 10-18 and DIV has 66-80. FP operations can be also computed side by side with ALU operations.

Share this post


Link to post
Share on other sites
quote:
Original post by Charles B
It''s false, unless you still work on a 486 or unless you use SIMD, iu ops (muls) are not faster than fpu ops (muls). They are even far slower on most machines.


well, i didn''t want to be that rude :D

Share this post


Link to post
Share on other sites
quote:

Only 10 mega flops !!! LOL Hopefully even a K6 could do much more. This tastes like benchmarks in heavilly unoptimized debug mode. I would say that the max speed is something like 16 giga flops "div by constant=mul" per second on a 2GHz SSE capable machine.



Actually it was in debug mode, added as a test in the middle of my *RENDERING LOOP* on another program.

I was just trying to illustrate the point that he''s worrying about something he shouldn''t be worrying about.

Share this post


Link to post
Share on other sites
If you need to constantly divide by these numbers, then convert them to reciprocal at compile time and use fp multiplication. However, to be fast, you don''t want to keep loading these constants from memory. So make you own multiplication routine, and don''t pop the fpu stack when you load the operand back into memory:

//global:

float denominators[4] = {1.0f / 11025.0f, //Let compiler store fp

1.0f / 22050.0f, //for you.

1.0f / 44100.0f,
1.0f / 96000.0f};

//Called at beginning of some segment of code that needs to

//divide by given constants a lot. Inlined b/c it''s only called

//once, ideally.

inline void LoadConstantsOnFPUStack(void)
{
_asm {
fld denominators[0]
fld denominators[1]
fld denominators[2]
fld denominators[3]
}
}

But since the x87 fpu stack has only 8 registers (r0 to r7), this may lead to a problem if other routines need to use the fpu. Solution: make the routines in need save the fpu state and then restore it afterward:

//global:

char FPUState[108];
//These two functions called if something else

//needs more than 4 fpu registers.

inline void SaveFPURegisters(void)
{
_asm {
fnsave FPUState
}
}
inline void RestoreFPURegisters(void)
{
_asm {
frstor FPUState
}
}

However, if another routine doen''t need more than 4 fpu registers, make sure the callee procedure balances the stack back, so that the top-of-stack pointer will be back where the value 960000 is. But since, it''s ambiguous as to wheter C++ does this implicitly, it''s probably a safer idea to save/restore the fpu state.
Now to multiply by your reciprocated denominators you could do this:

template<class T>
//For 96000 which is at st(0).

inline T multiply(T value)
{
_asm {
fld value //Value pushed onto stack, aliased to st(0).

//96000 goes to st(1).

fmul st(1) //st(0) = st(0) * st(1).

fstp value //Put st(0) in value and pop the stack back

//to its original state.

}
return value;
}
//And then do 3 more templated routines. One for 11025, 22050, 44100, by replacing this line

"fmul st(1)",
with
"fmul st(i)",
where i is 4, 3, and 2, respectively.

However, you said you wanted fast? Well in that case you''d have to right your own fpu code if you wanted to do several consecutive divisions by those constants. If you have to do severy divisions at once, even by different numbers, use SSE and the xmm registers. The IA-32 SSE instructions are valid on any Pentium III and later processor.

Share this post


Link to post
Share on other sites