# Let's build a resource for optimisations?

This topic is 4380 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

There's a whole bunch of standard questions about optimising various things - sin()/cos(), sqrt(), float->int conversion etc. which keep coming up here. So I thought maybe people could contribute to this thread - I'll keep updating it and adding all the 'approved' techniques to this first post. I'm thinking C/C++, obviously some things are compiler specific which is OK, just be clear - and also explain the accuracy of fast but inexact methods. Here's a quick list of some techniques I KNOW there are answers for - you supply/discuss them and I'll compile/format/organise them:

## A faster Square Root than sqrt()

### Exact

awaiting submission...

### Non-exact

awaiting submission...

## Faster than sin()/cos() etc

awaiting submission...

## Fast float->int conversion

### Truncated

awaiting submission...

### Rounded

awaiting submission...

## Fast double->int conversion

### Truncated

awaiting submission...

### Rounded

awaiting submission...

##### Share on other sites
Unfortunately, a lot of the simple mathematical functions are either implemented in hardware (meaning it is very unlikely you will find a faster software method, i.e. sin/cos) or have trivial code replacements that can already be done by the compiler (i.e., bitshifting vs. multiply/divide)

In a practical sense, one way to make code that uses such operations faster is to avoid doing them at all. Rearranging/rewriting code so that the result is mathematically equivilant, but there are fewer operations, is the best way to go. (For example, referring to a recent discussion on matrix inverse in the math&physics board, there are many alternative ways to go about it that all mathematically result in the inverse, but have wildly different performance characteristics)

##### Share on other sites
the hardware implemented things are one area where there's gains to be made - I've seen a few threads talking about how the MSVC++ compiler turns sin(x) into a whole load of stuff surrounding the SIN assembly operation.

That's one type of trick I'm thinking we can have here - inline ASM tricks to reduce overhead etc.
Another main one is algorithmic stuff - like an inexact sqrt replacement which can beat the FPU hard-wired instruction, or a fast cube-root trick.
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.

##### Share on other sites
Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

If you *really* want extreme math speeds, go with your archetecure's specific optimized library (Intel (x86 and Itanium), AMD (x86-64), IBM (PPC) all have them, I'm sure you can find a Freescale (PPC) one if you look). These companies pay people to squeeze the most performance out of their code for cluster computers, so that's really going to be the best way to go.

##### Share on other sites
Keep in mind that most optimizations are *not* about finding a faster sqrt (and if that was easy to do, the standard libraries would use that one already. Usually, the best way to beat the performance of standard libraries is to compromise something else. If you can live with lower accuracy you *might* gain a bit of performance, for example).
But the most powerful optimizations are usually more about restructuring your code in a more compiler-friendly way than about replacing math operations with faster versions.

##### Share on other sites
Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.

Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....

##### Share on other sites
Good idea!

Quote:
 Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

Unfortunately this doesn't turn out true in practice, due to combination of 1) implementor laziness 2) needing to work on all CPUs/platforms 3) no knowledge of app-specific situations like relaxed precision requirements.

Quote:
 Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....

Talking about it is easier than standing and delivering :))

square root (integer, exact and inexact float): see Paul Hsieh's very thorough coverage.

sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:
 meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)

float -> int (C truncate semantics):
#if USE_IA32_FLOAT_TO_INT	// implies HAVE_MS_ASM// notes:// - PTR is necessary because __declspec(naked) means the assembler//   cannot refer to parameter argument type to get it right.// - to conform with the fallback implementation (a C cast), we need to//   end up with truncate/"chop" rounding. subtracting does the trick,//   assuming RC is the IA-32 default round-to-nearest mode.static const float round_bias = 0.4999999f;__declspec(naked) i32 ia32_i32_from_float(float f){	UNUSED2(f);__asm{	push		eax	fld			DWORD PTR [esp+8]	fsub		[round_bias]	fistp		DWORD PTR [esp]	pop			eax	ret}}__declspec(naked) i32 ia32_i32_from_double(double d){	UNUSED2(d);__asm{	push		eax	fld			QWORD PTR [esp+8]	fsub		[round_bias]	fistp		DWORD PTR [esp]	pop			eax	ret}}__declspec(naked) i64 ia32_i64_from_double(double d){	UNUSED2(d);__asm{	push		edx	push		eax	fld			QWORD PTR [esp+12]	fsub		[round_bias]	fistp		QWORD PTR [esp]	pop			eax	pop			edx	ret}}#endif	// USE_IA32_FLOAT_TO_INT

Quote:
 And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.

--
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

##### Share on other sites
Quote:
Original post by Jan Wassenberg
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:
 meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)

Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway). So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

In the end, you lose accuracy and gain little (if anything). When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code. And it will be completely different in 18 months when new processor cores come out. In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)

##### Share on other sites
What people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?

Quote:
 Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

Sure. While cycle count is only a guide, of course you need to measure the whole thing in context to determine gains.

Quote:
 For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway).

Thank you for the review. But I think your assumption is flawed: what if SIN() is needed in exactly one inner loop, which is run through very, very often? That is the golden opportunity to inline code (no matter how complex it is) and most certainly beat FSIN. Apart from that case, you'd probably end up deciding not to inline, since that code is going to end up more than ~20 instructions. So we are left with static function call overhead, which is quite tiny.

Quote:
 So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?

Quote:
 When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code.

Hours? um, no. With some experience you can predict stalls due to microarchitecture, and otherwise just use a simulator.

Quote:
 And it will be completely different in 18 months when new processor cores come out.

Yes; such is the coder's burden. Actually how about seeing it as: guaranteed employment? :P
But come on, how is this at all relevant to the question of: "do I optimize or not?"? Either your app is too slow AND you have development time left, or not.

Quote:
 In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

*grin* This is one strawman I do not want to defend. What makes you think that is the case?
And how is "counting raw clocks" defined?
1) adding up latency of instructions = stupid
2) measuring real-world execution times (exact down to 1 clock) = good
3) predicting performance in simulator (which again gives performance of each instruction in clocks) = good

Quote:
 I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)

Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?

##### Share on other sites
how about a resource for high level algorithmic optimisations?

##### Share on other sites
What if that inner loop is unrolled, thereby bloating even more with many copies of the inlined optimization? Seems like it could go either way couldn't it?

##### Share on other sites
Quote:
 Original post by Jan WassenbergAre we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?
Well I hoped the former, but it seems to be going the other way. Either then everyone's decided they are of no use, or nobody knows any.

I thought we'd at least see that one about casting by using knowledge about the floating-point format...

##### Share on other sites
For "A faster Square Root than sqrt()" you could use
float fast_sqrt(float Arg){    __asm {        fld DWORD PTR _Arg\$[esp-4]        fsqrt        ret 0    }}
Alternately, you could just use 'sqrtf', which is what I used to have MSVS 2003 generate the above code (with default optimization settings in 'release' mode).

##### Share on other sites
Quote:
 Either then everyone's decided they are of no use, or nobody knows any.

Did you see this stuff in post#7?
- square root (integer, exact and inexact float)
- sin/cos
- float -> int (C truncate semantics)
- thread with link to mini-article on that topic [fast memcpy/memset replacements]

Quote:
 For "A faster Square Root than sqrt()" you could use [FSQRT]

Did you see the sqrt link above? Given that the FSQRT instruction takes 27 clocks in default double precision mode, it doesn't seem very fast.

Quote:
 What if that inner loop is unrolled, thereby bloating even more with many copies of the inlined optimization? Seems like it could go either way couldn't it?

If the loop body is large, then unrolling it would be a poor decision. Unrolling is no longer a surefire optimization as in the past, since jump overhead is very small and icache is a factor.

Quote:
 I thought we'd at least see that one about casting by using knowledge about the floating-point format...

You mean the 2**52 * 1.5 trick? See (1) and (2).

##### Share on other sites
Quote:
 Original post by Jan Wassenberg[...]Did you see the sqrt link above? Given that the FSQRT instruction takes 27 clocks in default double precision mode, it doesn't seem very fast.[...]
Did you use basic integer arithmatic to notice that 27 clocks is less than 2.7 * 10-8 seconds on an (estimated) average machine, which means you could do it ~37M times a second? That's on a 1.0 gigahertz machine that doesn't do any of the things modern processors do to perform parallel and psuedoparallel computation on a single CPU.

Memory access will be a far worse bottleneck than things like sqrt in most modern programs. If you're calling sqrt so often that it's a problem, you're going through a LOT of data. In that case, you'd do far better to get lower-latency RAM, a multi-processor system (presuming each cpu has independant access to the memory and they're not competing for bandiwdth), or upgrade to a processor with a larger cache.

##### Share on other sites
Quote:
 Did you use basic integer arithmatic to notice that 27 clocks is less than 2.7 * 10-8 seconds on an (estimated) average machine. That's on a 1.0 gigahertz machine [..]

Ooh, I'm not familiar with "arithmatic".
I understand you're trying to make a point here, but come on...

Quote:
 which means you could do it ~37M times a second?

Looks about right. However, that is absolutely meaningless and the wrong way of thinking to boot. If your program's response time is low enough, further optimization means you can process more data in the same time interval!

[quote]Memory access will be a far worse bottleneck than things like sqrt in most modern programs.
Ooh, "modern programs"?
But why should we care? Has it not been stated that sqrt optimizations are presented here *for those who need them*?

Quote:
 If you're calling sqrt so often that it's a problem, you're going through a LOT of data. In that case, you'd do far better to get lower-latency RAM, a multi-processor system (presuming each cpu has independant access to the memory and they're not competing for bandiwdth), or upgrade to a processor with a larger cache.

LOT of data: yep.
Are you going to give me and all users a new CPU, just so I can be lazy?

I no longer wish to waste my time here entertaining the silly and off-topic discussion of whether optimization is needed or not. If you aren't writing time-critical software or are satisfied with mediocrity, fine by me.

##### Share on other sites
Quote:
 Original post by Jan WassenbergWhat people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?

I never said that. I said that in some cases it was unlikely that you would find a practical, better solution than hardware functions. At any stage of development. Nowhere in this discussion have I said that it is wrong to optimize in such a way. I simply said that in many cases, such trivial optimizations are difficult to perform, and gave some examples of the major difficulties that a programmer would encounter. I even admitted that you could gain a little.

Quote:
 Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)But what are you on about with "set-associative", and why do you say "nearly killed"?

I don't want to get down into the details, but instruction caches usually store predecoded micro-ops which take up more space. And no modern cpu uses a fully associative cache - it's too slow - so they use set-associative caches which have minor tradeoffs in speed and wasted space. Like I said, we could spend hours discussing modern cpu architecture and performance. Computer architecture books are thick for a reason.

Quote:
 I no longer wish to waste my time here entertaining the silly and off-topic discussion of whether optimization is needed or not. If you aren't writing time-critical software or are satisfied with mediocrity, fine by me.

I'm not going to waste more time either. I posted what I thought was a useful comment on trying to optimize trivial operations - and I didn't even say that such optimizations were worthless, only that they were sometimes very difficult to do and unlikely to bring much benefit.

Sorry again to the OP, if I come across a useful trick I will post it here.

##### Share on other sites
If you want fast math, just use AMD's or Intel's math libraries, instead of making your own. They're available for download on their developer web pages.