#### Archived

This topic is now archived and is closed to further replies.

# Optimizing a divide

This topic is 5281 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Since I know very little about the internal workings of a CPU architecture, but I am nevertheless facing a very old problem, I decided to ask around a little. I need to implement as fast divides as possible for the denominators 11025, 22050, 44100 and 96000. If anyone can post relevant IEEE floating point standard compliant code (which I can''t write myself), it''d best, but I fully realize that''s a little arrongant for me to expect. However, since I''m also a wee bit lazy (to write a benchmark) and don''t have access to too many different CPU types/builds, I''ll also pop the simpler question here: which is faster - a division by a number or a multiplication by the precalculated constant of one divided by that number? Also, what implications might this choice have on different CPU architectures (I only have access to AthlonXP)?

##### Share on other sites
Find a problem before you try to solve it. Seriously.

If you have X hours to spend on optimizing your application, it''s best spending those hourse on the bottlenecks. If you see that the division turns up as a bottleneck in your application when you run a profiler, then you can start looking into this.

##### Share on other sites
quote:
Original post by Crispy
Since I know very little about the internal workings of a CPU architecture, but I am nevertheless facing a very old problem, I decided to ask around a little.

I need to implement as fast divides as possible for the denominators 11025, 22050, 44100 and 96000. If anyone can post relevant IEEE floating point standard compliant code (which I can''t write myself), it''d best, but I fully realize that''s a little arrongant for me to expect.

However, since I''m also a wee bit lazy (to write a benchmark) and don''t have access to too many different CPU types/builds, I''ll also pop the simpler question here: which is faster - a division by a number or a multiplication by the precalculated constant of one divided by that number? Also, what implications might this choice have on different CPU architectures (I only have access to AthlonXP)?

To answer the second question, multiplication is always faster than division. The truth is that if you''re just compiling code normally, the compiler knows what optimisations can be undertaken and generally will apply them if it is likely to be acceptable; case in point, the MSVC 7 compiler definatelly spots and acts upon oportunites to change divisions to multiplications of the inverse - I don''t know about other compilers, but I''d be supprised if any current major compilers don''t do it (besides, MSVC 7 can now be got for free). As a general rule keep your code as clean as possible & hope your compiler will do the hard work for you, then only when you''ve established that it can''t spot an optimisation that you have verified through profiling that you actually need re-write the relevent line or two. What happens next year when a suttle change in architecture makes the optimisation a hinderance rather than a help ?? (it happens...)

In the case of the first part of the question, what exactly are you after? so integer code to handle the FP ops? to be honest I''d suggest the rather traditional
c = a / b;
100% standards compilant, gets translated to exactly 1 opcode on any architecture, integer code to do the work is unlikely to be faster ? what''s wrong with it ? use floats if doubles are too slow for your purposes...

Why not mention what you''re actually trying to do? perhaps a better method over-all can be suggested.

##### Share on other sites
AP1: I''ve been coding for 6-7 years now and I rarely post a stupid question related to programming itself. This is not such a case - I''ve seen hundreds of (okay, that''s a gross overstatement) similar threads to this one all of which get similar replies. I am perfectly aware of the unwritten code of conduct (pun intended) related to optimizations. However, I also mentioned that I hardly know anything about CPU architecture, which suggests I know very little about Assembly (which is the case) to write the optimizations myself where I recognize that one truly is needed.

The numbers 11025, 22050, 44100 and 96000 are standard sampling rates for audio signals, so a little bit of deductive thinking would suggest I''m dealing with substantial amounts of devisions as opposed to some negligible few per second (in my case it can be up to 200-300 thousand (per second), depending on the number of distict signals).

I also mentioned the IEEE floating point standard, which also mildly suggests that I''m dealing with single precision floating point operations, not integer or double precision floating point operations.

The reason I asked is that I am not sure whether introducing a 1/n multiplicand would weigh up the simpler use of a division if the division can be accomplished just as fast on most CPU''s (to me this would translate to one instruction or something like that). Thanks for answering that, AP2.

As for suggesting a better method - I don''t see a way to speed up signal normalization for a second''s period through means other than dividing all the samples in it by the sampling rate.

##### Share on other sites
1,000,000 floating point divide operations, on a Athlon XP 2000+ takes < 100ms.

##### Share on other sites
A quick look on the internet suggests that the latency for an FDIV instruction (ie fpu''s div inst) is 17 cycles for a P6 (fmul = 5), assuming that you''ve got 300,000 of them, then you''re talking about 5.1 million cycles being devoted to the division out of what ? say 1000 million (ie a GHz) ? say 0.51% of cpu resources ? its simply not worth worrying about, unless you''ve written a program & established that this really is an issue.

The other thing worth noting is that the above paragraph is complete bollocks, in that current architectures can significantly mitigate this cost through multiple pipelines - hence the numbers mean very little except perhaps an upper-bound. If you''re dealing with large volumes of data you''re bottle-neck it very probably IO.

##### Share on other sites
if you''re really only dividing by: 11025, 22050, 44100 and 96000. then why not:

static const float oneOver11025 = 1.0f/22050.0f;static const float oneOver22050 = 1.0f/22050.0f;static const float oneOver44100 = 1.0f/44100.0f;static const float oneOver96000 = 1.0f/96000.0f;float newNumber = originalNumber * oneover11025;//etc, etc, etc...

but i also agree with all the other people. unless a profiler has told you that this particular function/division is really _the_ thing that is slowing down your app, you are wasting your time optimising it.

-me

##### Share on other sites
You should consider using SSE and 3DNow. These are meant for streaming like this.

In the case of SSE you can handle 4 samples at once
(4 reciprocal multiplications).

And if you''re really after speed, then you should consider
prefetching and non temporal stores with SSE & 3DNow.

##### Share on other sites
quote:
Original post by Crispy
AP1: I''ve been coding for 6-7 years now and I rarely post a stupid question related to programming itself. This is not such a case - I''ve seen hundreds of (okay, that''s a gross overstatement) similar threads to this one all of which get similar replies. I am perfectly aware of the unwritten code of conduct (pun intended) related to optimizations. However, I also mentioned that I hardly know anything about CPU architecture, which suggests I know very little about Assembly (which is the case) to write the optimizations myself where I recognize that one truly is needed.

The numbers 11025, 22050, 44100 and 96000 are standard sampling rates for audio signals, so a little bit of deductive thinking would suggest I''m dealing with substantial amounts of devisions as opposed to some negligible few per second (in my case it can be up to 200-300 thousand (per second), depending on the number of distict signals).

I also mentioned the IEEE floating point standard, which also mildly suggests that I''m dealing with single precision floating point operations, not integer or double precision floating point operations.

The reason I asked is that I am not sure whether introducing a 1/n multiplicand would weigh up the simpler use of a division if the division can be accomplished just as fast on most CPU''s (to me this would translate to one instruction or something like that). Thanks for answering that, AP2.

As for suggesting a better method - I don''t see a way to speed up signal normalization for a second''s period through means other than dividing all the samples in it by the sampling rate.
Hello, I''m the "AP1". Too many APs here I think

As AP2 said, compilers are great today. It''s rare that you can do something smarter than the compiler in the general case.

I didn''t mean to offend you, sorry if I did. Still I think my point is valid. Optimize after you have found the bottlenecks, not before. Even in the case you describe here I doubt you''ll see it in the profiler. You will probably have other bottlenecks instead. Spend the time on what you see in the profiler. There are tons of litterature written about this, it''s not just me having this opinion.

##### Share on other sites
Foe what purpose do you want to divide all the samples by the sampling rate?

1. 1
Rutin
37
2. 2
3. 3
4. 4
5. 5

• 11
• 12
• 14
• 9
• 9
• ### Forum Statistics

• Total Topics
633349
• Total Posts
3011464
• ### Who's Online (See full list)

There are no registered users currently online

×