Let's build a resource for optimisations?

Started by
16 comments, last by Fruny 18 years, 3 months ago
There's a whole bunch of standard questions about optimising various things - sin()/cos(), sqrt(), float->int conversion etc. which keep coming up here. So I thought maybe people could contribute to this thread - I'll keep updating it and adding all the 'approved' techniques to this first post. I'm thinking C/C++, obviously some things are compiler specific which is OK, just be clear - and also explain the accuracy of fast but inexact methods. Here's a quick list of some techniques I KNOW there are answers for - you supply/discuss them and I'll compile/format/organise them:

A faster Square Root than sqrt()

Exact

awaiting submission...

Non-exact

awaiting submission...

Faster than sin()/cos() etc

awaiting submission...

Fast float->int conversion

Truncated

awaiting submission...

Rounded

awaiting submission...

Fast double->int conversion

Truncated

awaiting submission...

Rounded

awaiting submission...
Advertisement
Unfortunately, a lot of the simple mathematical functions are either implemented in hardware (meaning it is very unlikely you will find a faster software method, i.e. sin/cos) or have trivial code replacements that can already be done by the compiler (i.e., bitshifting vs. multiply/divide)

In a practical sense, one way to make code that uses such operations faster is to avoid doing them at all. Rearranging/rewriting code so that the result is mathematically equivilant, but there are fewer operations, is the best way to go. (For example, referring to a recent discussion on matrix inverse in the math&physics board, there are many alternative ways to go about it that all mathematically result in the inverse, but have wildly different performance characteristics)
the hardware implemented things are one area where there's gains to be made - I've seen a few threads talking about how the MSVC++ compiler turns sin(x) into a whole load of stuff surrounding the SIN assembly operation.

That's one type of trick I'm thinking we can have here - inline ASM tricks to reduce overhead etc.
Another main one is algorithmic stuff - like an inexact sqrt replacement which can beat the FPU hard-wired instruction, or a fast cube-root trick.
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.
Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

If you *really* want extreme math speeds, go with your archetecure's specific optimized library (Intel (x86 and Itanium), AMD (x86-64), IBM (PPC) all have them, I'm sure you can find a Freescale (PPC) one if you look). These companies pay people to squeeze the most performance out of their code for cluster computers, so that's really going to be the best way to go.

Keep in mind that most optimizations are *not* about finding a faster sqrt (and if that was easy to do, the standard libraries would use that one already. Usually, the best way to beat the performance of standard libraries is to compromise something else. If you can live with lower accuracy you *might* gain a bit of performance, for example).
But the most powerful optimizations are usually more about restructuring your code in a more compiler-friendly way than about replacing math operations with faster versions.
Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.

Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....
Good idea!

Quote:Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

Unfortunately this doesn't turn out true in practice, due to combination of 1) implementor laziness 2) needing to work on all CPUs/platforms 3) no knowledge of app-specific situations like relaxed precision requirements.

Quote:Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.
Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....

Talking about it is easier than standing and delivering :))


square root (integer, exact and inexact float): see Paul Hsieh's very thorough coverage.

sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)

float -> int (C truncate semantics):
#if USE_IA32_FLOAT_TO_INT	// implies HAVE_MS_ASM// notes:// - PTR is necessary because __declspec(naked) means the assembler//   cannot refer to parameter argument type to get it right.// - to conform with the fallback implementation (a C cast), we need to//   end up with truncate/"chop" rounding. subtracting does the trick,//   assuming RC is the IA-32 default round-to-nearest mode.static const float round_bias = 0.4999999f;__declspec(naked) i32 ia32_i32_from_float(float f){	UNUSED2(f);__asm{	push		eax	fld			DWORD PTR [esp+8]	fsub		[round_bias]	fistp		DWORD PTR [esp]	pop			eax	ret}}__declspec(naked) i32 ia32_i32_from_double(double d){	UNUSED2(d);__asm{	push		eax	fld			QWORD PTR [esp+8]	fsub		[round_bias]	fistp		DWORD PTR [esp]	pop			eax	ret}}__declspec(naked) i64 ia32_i64_from_double(double d){	UNUSED2(d);__asm{	push		edx	push		eax	fld			QWORD PTR [esp+12]	fsub		[round_bias]	fistp		QWORD PTR [esp]	pop			eax	pop			edx	ret}}#endif	// USE_IA32_FLOAT_TO_INT


Quote:And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.

thread with link to mini-article on that topic.

--
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)


Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway). So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

In the end, you lose accuracy and gain little (if anything). When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code. And it will be completely different in 18 months when new processor cores come out. In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)
What people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?

Quote:Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

Sure. While cycle count is only a guide, of course you need to measure the whole thing in context to determine gains.

Quote:For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway).

Thank you for the review. But I think your assumption is flawed: what if SIN() is needed in exactly one inner loop, which is run through very, very often? That is the golden opportunity to inline code (no matter how complex it is) and most certainly beat FSIN. Apart from that case, you'd probably end up deciding not to inline, since that code is going to end up more than ~20 instructions. So we are left with static function call overhead, which is quite tiny.

Quote:So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?

Quote:When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code.

Hours? um, no. With some experience you can predict stalls due to microarchitecture, and otherwise just use a simulator.

Quote:And it will be completely different in 18 months when new processor cores come out.

Yes; such is the coder's burden. Actually how about seeing it as: guaranteed employment? :P
But come on, how is this at all relevant to the question of: "do I optimize or not?"? Either your app is too slow AND you have development time left, or not.

Quote:In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

*grin* This is one strawman I do not want to defend. What makes you think that is the case?
And how is "counting raw clocks" defined?
1) adding up latency of instructions = stupid
2) measuring real-world execution times (exact down to 1 clock) = good
3) predicting performance in simulator (which again gives performance of each instruction in clocks) = good

Quote:I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)

Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
how about a resource for high level algorithmic optimisations?

This topic is closed to new replies.

Advertisement