Let's build a resource for optimisations?
There's a whole bunch of standard questions about optimising various things - sin()/cos(), sqrt(), float->int conversion etc. which keep coming up here.
So I thought maybe people could contribute to this thread - I'll keep updating it and adding all the 'approved' techniques to this first post.
I'm thinking C/C++, obviously some things are compiler specific which is OK, just be clear - and also explain the accuracy of fast but inexact methods.
Here's a quick list of some techniques I KNOW there are answers for - you supply/discuss them and I'll compile/format/organise them:
Unfortunately, a lot of the simple mathematical functions are either implemented in hardware (meaning it is very unlikely you will find a faster software method, i.e. sin/cos) or have trivial code replacements that can already be done by the compiler (i.e., bitshifting vs. multiply/divide)
In a practical sense, one way to make code that uses such operations faster is to avoid doing them at all. Rearranging/rewriting code so that the result is mathematically equivilant, but there are fewer operations, is the best way to go. (For example, referring to a recent discussion on matrix inverse in the math&physics board, there are many alternative ways to go about it that all mathematically result in the inverse, but have wildly different performance characteristics)
In a practical sense, one way to make code that uses such operations faster is to avoid doing them at all. Rearranging/rewriting code so that the result is mathematically equivilant, but there are fewer operations, is the best way to go. (For example, referring to a recent discussion on matrix inverse in the math&physics board, there are many alternative ways to go about it that all mathematically result in the inverse, but have wildly different performance characteristics)
the hardware implemented things are one area where there's gains to be made - I've seen a few threads talking about how the MSVC++ compiler turns sin(x) into a whole load of stuff surrounding the SIN assembly operation.
That's one type of trick I'm thinking we can have here - inline ASM tricks to reduce overhead etc.
Another main one is algorithmic stuff - like an inexact sqrt replacement which can beat the FPU hard-wired instruction, or a fast cube-root trick.
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.
That's one type of trick I'm thinking we can have here - inline ASM tricks to reduce overhead etc.
Another main one is algorithmic stuff - like an inexact sqrt replacement which can beat the FPU hard-wired instruction, or a fast cube-root trick.
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.
Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.
If you *really* want extreme math speeds, go with your archetecure's specific optimized library (Intel (x86 and Itanium), AMD (x86-64), IBM (PPC) all have them, I'm sure you can find a Freescale (PPC) one if you look). These companies pay people to squeeze the most performance out of their code for cluster computers, so that's really going to be the best way to go.
If you *really* want extreme math speeds, go with your archetecure's specific optimized library (Intel (x86 and Itanium), AMD (x86-64), IBM (PPC) all have them, I'm sure you can find a Freescale (PPC) one if you look). These companies pay people to squeeze the most performance out of their code for cluster computers, so that's really going to be the best way to go.
Keep in mind that most optimizations are *not* about finding a faster sqrt (and if that was easy to do, the standard libraries would use that one already. Usually, the best way to beat the performance of standard libraries is to compromise something else. If you can live with lower accuracy you *might* gain a bit of performance, for example).
But the most powerful optimizations are usually more about restructuring your code in a more compiler-friendly way than about replacing math operations with faster versions.
But the most powerful optimizations are usually more about restructuring your code in a more compiler-friendly way than about replacing math operations with faster versions.
Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.
Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....
Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....
Good idea!
Unfortunately this doesn't turn out true in practice, due to combination of 1) implementor laziness 2) needing to work on all CPUs/platforms 3) no knowledge of app-specific situations like relaxed precision requirements.
Talking about it is easier than standing and delivering :))
square root (integer, exact and inexact float): see Paul Hsieh's very thorough coverage.
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
float -> int (C truncate semantics):
thread with link to mini-article on that topic.
--
Quote:Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.
Unfortunately this doesn't turn out true in practice, due to combination of 1) implementor laziness 2) needing to work on all CPUs/platforms 3) no knowledge of app-specific situations like relaxed precision requirements.
Quote:Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.
Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....
Talking about it is easier than standing and delivering :))
square root (integer, exact and inexact float): see Paul Hsieh's very thorough coverage.
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:meaning it is very unlikely you will find a faster software method, i.e. sin/cosShow me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)
float -> int (C truncate semantics):
#if USE_IA32_FLOAT_TO_INT // implies HAVE_MS_ASM// notes:// - PTR is necessary because __declspec(naked) means the assembler// cannot refer to parameter argument type to get it right.// - to conform with the fallback implementation (a C cast), we need to// end up with truncate/"chop" rounding. subtracting does the trick,// assuming RC is the IA-32 default round-to-nearest mode.static const float round_bias = 0.4999999f;__declspec(naked) i32 ia32_i32_from_float(float f){ UNUSED2(f);__asm{ push eax fld DWORD PTR [esp+8] fsub [round_bias] fistp DWORD PTR [esp] pop eax ret}}__declspec(naked) i32 ia32_i32_from_double(double d){ UNUSED2(d);__asm{ push eax fld QWORD PTR [esp+8] fsub [round_bias] fistp DWORD PTR [esp] pop eax ret}}__declspec(naked) i64 ia32_i64_from_double(double d){ UNUSED2(d);__asm{ push edx push eax fld QWORD PTR [esp+12] fsub [round_bias] fistp QWORD PTR [esp] pop eax pop edx ret}}#endif // USE_IA32_FLOAT_TO_INT
Quote:And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.
thread with link to mini-article on that topic.
--
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Jan Wassenberg
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)Quote:meaning it is very unlikely you will find a faster software method, i.e. sin/cosShow me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)
Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.
For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway). So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.
In the end, you lose accuracy and gain little (if anything). When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code. And it will be completely different in 18 months when new processor cores come out. In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"
I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)
What people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?
Sure. While cycle count is only a guide, of course you need to measure the whole thing in context to determine gains.
Thank you for the review. But I think your assumption is flawed: what if SIN() is needed in exactly one inner loop, which is run through very, very often? That is the golden opportunity to inline code (no matter how complex it is) and most certainly beat FSIN. Apart from that case, you'd probably end up deciding not to inline, since that code is going to end up more than ~20 instructions. So we are left with static function call overhead, which is quite tiny.
Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?
Hours? um, no. With some experience you can predict stalls due to microarchitecture, and otherwise just use a simulator.
Yes; such is the coder's burden. Actually how about seeing it as: guaranteed employment? :P
But come on, how is this at all relevant to the question of: "do I optimize or not?"? Either your app is too slow AND you have development time left, or not.
*grin* This is one strawman I do not want to defend. What makes you think that is the case?
And how is "counting raw clocks" defined?
1) adding up latency of instructions = stupid
2) measuring real-world execution times (exact down to 1 clock) = good
3) predicting performance in simulator (which again gives performance of each instruction in clocks) = good
Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?
Quote:Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.
Sure. While cycle count is only a guide, of course you need to measure the whole thing in context to determine gains.
Quote:For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway).
Thank you for the review. But I think your assumption is flawed: what if SIN() is needed in exactly one inner loop, which is run through very, very often? That is the golden opportunity to inline code (no matter how complex it is) and most certainly beat FSIN. Apart from that case, you'd probably end up deciding not to inline, since that code is going to end up more than ~20 instructions. So we are left with static function call overhead, which is quite tiny.
Quote:So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.
Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?
Quote:When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code.
Hours? um, no. With some experience you can predict stalls due to microarchitecture, and otherwise just use a simulator.
Quote:And it will be completely different in 18 months when new processor cores come out.
Yes; such is the coder's burden. Actually how about seeing it as: guaranteed employment? :P
But come on, how is this at all relevant to the question of: "do I optimize or not?"? Either your app is too slow AND you have development time left, or not.
Quote:In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"
*grin* This is one strawman I do not want to defend. What makes you think that is the case?
And how is "counting raw clocks" defined?
1) adding up latency of instructions = stupid
2) measuring real-world execution times (exact down to 1 clock) = good
3) predicting performance in simulator (which again gives performance of each instruction in clocks) = good
Quote:I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)
Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement