my concept - the importance of using floating point

Started by
17 comments, last by Ravyne 12 years, 7 months ago
After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).
Advertisement

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

Rob's reply was a joke. Ravyne's is not, however. Your program will run faster if it is able to fully utilize the CPU. But simply using floats like Goof suggests is hardly the answer.


After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).

Hahaha got ya! :)
[size=2][ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.


There is a (naive) truth to it, precisely as I laid out in my post. Inside modern CPUs there are several execution units which can operate independently of one another -- If you interleave instructions which target each of those units correctly then more of them will be busy more often, yielding *potentially* higher performance -- I say "potentially" because your problem and your algorithmic approach must be amenable to this type of optimization to have any hope of benefiting from it -- in other words, if your algorithm is Load/Store bound, then there's only so much that re-ordering its integer math can do. However, if the algorithm is rather "balanced" across functional units, then the order of CPU instructions can be of great benefit -- its common to hide memory latency by pre-fetching data a couple instructions ahead of where its needed, for example (though there are tradeoffs to consider even in this simple case, such as register pressure).

Then there's lots of modern processor features which make the benefit unpredictable, if not unnecessary -- Instruction re-ordering can already alleviate many such order-dependent bottlenecks, hyperthreading can make sure more units are busy running another thread when possible, or dual-thread/in-order cores (like on Xbox 360 or PS3) can execute an alternate thread when the other stalls awaiting memory.

Its one of those ideas where the theory is simple and sound, but which takes a great deal of expertise and a measure of luck to exercise favorably.

Now, if tomorrow Intel decides that the way forward is to throw away our fancy, super-scaler, out-of-order 4-core Sandy Bridge CPUs, and in its place they're going to give us 32 4Ghz Atom processors (er, basically Knight's Ferry) or that we all need to migrate to Itanium, then the effect of instruction ordering becomes more predictable and more important. Otherwise, modern CPUs are pretty darn smart as it is, and will do a decent job all on their own, provided you don't go out of your way to sabotage their efforts.

throw table_exception("(? ???)? ? ???");

@RobTheBloke - pure genius :lol:

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.


There is a partial, vagueish truth to what GoofProg.F says (in so much as interleaving instructions can keep the CPU busier), but he is entirely wrong in the suggested approach to exploit this optimisation. To start with, let's just state an absolute truth from which you cannot escape:

Optimisations that work for one series of CPU (eg core i7) don't always translate to improvements on other CPUs (eg Atom) - and some times they can actually hurt performance! So, it is impossible to optimise code to be perfect on all platforms, the best you can do is make sure it doesn't run BADLY on all architectures

Most CPU's have seperate execution units which can (if used correctly) execute instructions in parallel (something that require Out-Of-Order-Execution). A simple CPU (such as the Atom) has two (IIRC), whereas some of the newer CPU's have 5 or more. These execution units are not the same however. So some may do integer/floating point addition, some may do integer/floating point multiplication, etc - but they aren't always done by type (eg some may be float only, some may be an int/float mix). Taking something like the Atom, one execution unit can do mult/div/sqrt, and the other can only do addition/subtraction. This makes sense if you consider this code:


float a[NUM];
float f = 1.0;

for(int i; i<NUM; i++)
{
f *= a;
}


The *= is a floating point mult op. The dereferencing of 'a' is an integer addition operation (since a can be re-written as *(a+i)). So what you have is an int & a float op next to each other. In theory the CPU could execute both at the same time - but in this example that won't actually happen. The problem is that there is a dependency between the result of a and the multiplication. However..... there is also another integer op happening, and that is i++. There is a good chance that the CPU may be able to compute i++ and the *= within the same CPU cycle. Win!

At this point you can unroll the loop slightly (and break a couple of dependencies whilst we are at it). i.e.


float a[NUM];
float f0=1.0f,f1=1.0f,f2=1.0f,f3=1.0f;

for(int i; i<NUM; i+=4)
{
f0 *= a[i ];
f1 *= a[i+1];
f2 *= a[i+2];
f3 *= a[i+3];
}

// if NUM is not a mulitple of 4, don't forget to handle remaining elements here!

float f = f0 * f1 * f2 * f3;


At this point you have a healthy mix of different op types (without dependencies between them) that could, in theory, be executed in parallel by the CPU. So... The temptation is to go and write all of your loops like that. Well. Don't. The compiler is there to do that stuff for you, and may decide to insert multiple versions of that code into the exe (e.g. one for Atom, one for core2, one for i7). By manually unrolling the loop you may make the code run slower on some platforms (for example it's possible that a given CPU benefits more from good instruction cache usage, and therefore less code in the loop will be beneficial). If you start optimising code this way and it brings noticeable benefits, bare in mind you've only seen benefits on YOUR CPU, so you should test on other architectures too (because you may have made accidentally things worse!)

Ultimately it's all a bit of a black art really, led mainly by experience, a good knowlegde of the CPU architecture, and a lot of profiler driven guesstimation.....

This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.


Not sure who marked this post negatively - It's spot on....
Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.
Latest project: Sideways Racing on the iPad

Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.


It's very rarely worth the effort.

If you took a medium sized coding project and applied loop-unrolling and instruction interleaving everywhere, chances are your compiled code size will have increased noticeably. At this point there is an extremely good chance that the overall performance of your app would have dropped a LOT! The problem is that memory is normally the biggest bottleneck on modern systems. Increased code size leads to: more memory reads; more page faults; worse instruction cache usage and so on and so forth.

This is why it is utterly pointless to worry about optimisation until a profiler has told you that you have a problem! If appplied sparingly, in the correct place, these optimisations *can* bring about a big performance increases. However, applying them globally is always a recipe for disaster.

There are some things that are almost always worth optimising though....
* Replacing one algorithm with one better suited to your problem space.
* Minimising memory reads via better data structures (eg Oct/Quad/Kd Trees)
* Minimising memory reads via data compression (decompression cost is usually negligable compared to the time saved by not reading from memory)
* Minimising data dependencies (although the compiler usually does a good job, it rarely hurts to give it hints)
* Minimising FileIO times (compression again - fileIO being orders of magnitude slower than memory IO)
* SIMD / Multi-core optimisations (imho, this is not an optimisation, it is "building in efficiency". There are however limits!)

All of the above allow for huge performance increases, without ending up with an unmaintainable codebase.

After that, the law of diminishing-returns applies. For example, interleaving instructions across the codebase may take a few weeks, introduce some bugs along the way, and end up providing only a single extra frame per second for your efforts (on a game already running at 70fps). I don't completely agree with the "Don't optimise early" adage, my personal take is "Build in efficiency, and only optimise when someone can demonstrate you actually have a problem that needs to be fixed! (And that problem is actually affecting the release builds!)"

[quote name='Ravyne' timestamp='1316029428' post='4861708']
This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.


Not sure who marked this post negatively - It's spot on....
[/quote]

Off topic, but yeah I kinda wondered too. Its actually pretty interesting that I never find myself downvoted on the basis of my argument (mostly I suppose because I refrain from making strong statements about things I'm less versed in), but I do seem to anger someone now and again (if I know myself to not be factually incorrect, then I can only assume someone is taking revenge for feeling slighted in some way) -- or maybe someone's finger slipped, in which case the forum software fails for not letting people change their votes.

Its really too bad when that happens though, because someone will come along 6 months from now and assume a post was down-rated for good reason, and so if it was done intentionally, then someone has done some small amount of damage to the community in order to avenge their own bruised ego. My rating is high enough that the effect on me personally is negligable, its the implicit misinformation that upsets me and the fact that I took time out of my day to provide a reasoned answer, and some nitwit just called it into question with the click of a button. Such is the internet, I suppose.


I assume it was you who helped fix it, so thanks. I've also taken on a Robin Hood-like role where I will help right obvious misuse of the rating system.

throw table_exception("(? ???)? ? ???");

This topic is closed to new replies.

Advertisement