Jump to content
  • Advertisement
Sign in to follow this  
d000hg

Let's build a resource for optimisations?

This topic is 4536 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

There's a whole bunch of standard questions about optimising various things - sin()/cos(), sqrt(), float->int conversion etc. which keep coming up here. So I thought maybe people could contribute to this thread - I'll keep updating it and adding all the 'approved' techniques to this first post. I'm thinking C/C++, obviously some things are compiler specific which is OK, just be clear - and also explain the accuracy of fast but inexact methods. Here's a quick list of some techniques I KNOW there are answers for - you supply/discuss them and I'll compile/format/organise them:

A faster Square Root than sqrt()

Exact

awaiting submission...

Non-exact

awaiting submission...

Faster than sin()/cos() etc

awaiting submission...

Fast float->int conversion

Truncated

awaiting submission...

Rounded

awaiting submission...

Fast double->int conversion

Truncated

awaiting submission...

Rounded

awaiting submission...

Share this post


Link to post
Share on other sites
Advertisement
Unfortunately, a lot of the simple mathematical functions are either implemented in hardware (meaning it is very unlikely you will find a faster software method, i.e. sin/cos) or have trivial code replacements that can already be done by the compiler (i.e., bitshifting vs. multiply/divide)

In a practical sense, one way to make code that uses such operations faster is to avoid doing them at all. Rearranging/rewriting code so that the result is mathematically equivilant, but there are fewer operations, is the best way to go. (For example, referring to a recent discussion on matrix inverse in the math&physics board, there are many alternative ways to go about it that all mathematically result in the inverse, but have wildly different performance characteristics)

Share this post


Link to post
Share on other sites
the hardware implemented things are one area where there's gains to be made - I've seen a few threads talking about how the MSVC++ compiler turns sin(x) into a whole load of stuff surrounding the SIN assembly operation.

That's one type of trick I'm thinking we can have here - inline ASM tricks to reduce overhead etc.
Another main one is algorithmic stuff - like an inexact sqrt replacement which can beat the FPU hard-wired instruction, or a fast cube-root trick.
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.

Share this post


Link to post
Share on other sites
Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

If you *really* want extreme math speeds, go with your archetecure's specific optimized library (Intel (x86 and Itanium), AMD (x86-64), IBM (PPC) all have them, I'm sure you can find a Freescale (PPC) one if you look). These companies pay people to squeeze the most performance out of their code for cluster computers, so that's really going to be the best way to go.

Share this post


Link to post
Share on other sites
Keep in mind that most optimizations are *not* about finding a faster sqrt (and if that was easy to do, the standard libraries would use that one already. Usually, the best way to beat the performance of standard libraries is to compromise something else. If you can live with lower accuracy you *might* gain a bit of performance, for example).
But the most powerful optimizations are usually more about restructuring your code in a more compiler-friendly way than about replacing math operations with faster versions.

Share this post


Link to post
Share on other sites
Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.

Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....

Share this post


Link to post
Share on other sites
Good idea!

Quote:
Or you could use the optimized versions already in the libraries. You know, they've seen the most action over the years, compilers already know how to deal with them with extreme effiecency.

Unfortunately this doesn't turn out true in practice, due to combination of 1) implementor laziness 2) needing to work on all CPUs/platforms 3) no knowledge of app-specific situations like relaxed precision requirements.

Quote:
Yes but the point of this thread is to group the 'low-level' optimisations, not discuss whether they should be used.
Has anyone actually got any to contribute - I'm sure questions about this stuff used to be quite frequent....

Talking about it is easier than standing and delivering :))


square root (integer, exact and inexact float): see Paul Hsieh's very thorough coverage.

sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:
meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)

float -> int (C truncate semantics):

#if USE_IA32_FLOAT_TO_INT // implies HAVE_MS_ASM

// notes:
// - PTR is necessary because __declspec(naked) means the assembler
// cannot refer to parameter argument type to get it right.
// - to conform with the fallback implementation (a C cast), we need to
// end up with truncate/"chop" rounding. subtracting does the trick,
// assuming RC is the IA-32 default round-to-nearest mode.

static const float round_bias = 0.4999999f;

__declspec(naked) i32 ia32_i32_from_float(float f)
{
UNUSED2(f);
__asm{
push eax
fld DWORD PTR [esp+8]
fsub [round_bias]
fistp DWORD PTR [esp]
pop eax
ret
}}

__declspec(naked) i32 ia32_i32_from_double(double d)
{
UNUSED2(d);
__asm{
push eax
fld QWORD PTR [esp+8]
fsub [round_bias]
fistp DWORD PTR [esp]
pop eax
ret
}}

__declspec(naked) i64 ia32_i64_from_double(double d)
{
UNUSED2(d);
__asm{
push edx
push eax
fld QWORD PTR [esp+12]
fsub [round_bias]
fistp QWORD PTR [esp]
pop eax
pop edx
ret
}}

#endif // USE_IA32_FLOAT_TO_INT


Quote:
And then there's stuff like SIMD techniques for fast memcpy/memset replacements and so on.

thread with link to mini-article on that topic.

--
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40
E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

Share this post


Link to post
Share on other sites
Quote:
Original post by Jan Wassenberg
sin/cos:
1) if you need both at the same time, use FSINCOS instruction (hard to beat)
2)
Quote:
meaning it is very unlikely you will find a faster software method, i.e. sin/cos
Show me someone who cannot implement a decent Taylor series approximation that runs in less than the 200 clocks needed by FPU, and I'll show you an incompetent coder :)


Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway). So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

In the end, you lose accuracy and gain little (if anything). When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code. And it will be completely different in 18 months when new processor cores come out. In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)

Share this post


Link to post
Share on other sites
What people don't realize is that threads such as this one PRESUPPOSE the need to make use of any of these techniques. Can't we *for once* talk about this without someone saying "premature optimization = boo"?

Quote:
Well, let me look into that example a bit. You would sacrifice accuracy first, in favor of the simpler, faster instructions of the taylor series. Certainly, you might be able to get it so that the best-case number of clock cycles for each instruction sums to less than the best-case clocks for the sin function. But those are raw numbers and not really representative of what is truly going on inside the processor.

Sure. While cycle count is only a guide, of course you need to measure the whole thing in context to determine gains.

Quote:
For example, one big problem is considering implementation. If you implement it as a function, then you have function call overhead. If it's inline, then you have suddenly expanded the number of instructions in your program by a large margin (if you don't, then you aren't using sin() enough to make a difference anyway).

Thank you for the review. But I think your assumption is flawed: what if SIN() is needed in exactly one inner loop, which is run through very, very often? That is the golden opportunity to inline code (no matter how complex it is) and most certainly beat FSIN. Apart from that case, you'd probably end up deciding not to inline, since that code is going to end up more than ~20 instructions. So we are left with static function call overhead, which is quite tiny.

Quote:
So you get code bloat, using more cpu cache - and worst yet, what was previously one instruction(which obviously can only fit into at most one cache set) is now many which could span two cache sets. So now you've nearly killed your set-associative L1 instruction cache, slowing down other parts of your program. And so on.

Sure, increased i-cache use is a disadvantage of replacing CISC-style instructions, but this is perhaps much exaggerated. With 64KiB icache, assuming 20 instructions @ 3bytes a pop, we are using 0.1% of cache. (before 386 with no icache at all and especially on 8088 with really bad bus, you indeed had to watch out for this)
But what are you on about with "set-associative", and why do you say "nearly killed"?

Quote:
When you get down into micro-optimizations like this, the third-order effects matter greatly. We could spend hours discussing the effects of micro-ops fusion, superscalar/reordering instruction units, etc., on any small piece of code.

Hours? um, no. With some experience you can predict stalls due to microarchitecture, and otherwise just use a simulator.

Quote:
And it will be completely different in 18 months when new processor cores come out.

Yes; such is the coder's burden. Actually how about seeing it as: guaranteed employment? :P
But come on, how is this at all relevant to the question of: "do I optimize or not?"? Either your app is too slow AND you have development time left, or not.

Quote:
In your style, I would say, "show me someone who thinks counting raw clocks is a measure of performance, and I will show you a computer architect stuck in the '80s"

*grin* This is one strawman I do not want to defend. What makes you think that is the case?
And how is "counting raw clocks" defined?
1) adding up latency of instructions = stupid
2) measuring real-world execution times (exact down to 1 clock) = good
3) predicting performance in simulator (which again gives performance of each instruction in clocks) = good

Quote:
I apologize to the OP for bringing this thread OT. But I think it is entirely relevant to the discussion of micro-optimizations (which are not all bad, not at all)

Are we *discussing* micro-optimizations, as in "do you like them?"? Or were we going to give a list of known ones to help out others?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!