my concept - the importance of using floating point

This topic is 2322 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

A computer game should take advantage of using floating point numbers because the over use of integer numbers may actually overload the CPU for math processing with the FPU potion can be used for math processing. Feedback please. What do you guys think? Let me start on why. A small computer game will never really see a performance hit but all the new mathemathics and physics used now probably is expected to go to a special module to take care of the processing, but maybe those modules are not really needed because the processors already have a portion of that ability inside of their MMX,SSE, and FPU processing.

thanks.

Share on other sites
Daddy, I'm cold. Can you build me a campfire?

Share on other sites
You are too concerned about trivial things like this. If you are pushing a machine that hard, there are more likely other ways to improve performance. Besides, any sort of gain would be negligible.

Share on other sites

you should rethink your whole strategy dog

Share on other sites
MMX and SSE registers can be used for integer processing btw, not just floating points. But your primary concern should be about using the right data format for the right job, as opposed to doing premature optimisations.

Share on other sites
This guy is kind of weird.

Share on other sites

A final word of warning: There is so much mis-information and downright lies in internet forums about optimisation techniques, that imho it is best to assume that every forum post is full of mis-information and lies. The only way to know if something is true or not is to verify it in a profiler.....

Profiled it. You're right, the final method you proposed gave me a 200x performance boost. Thanks!

Share on other sites
This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.

Share on other sites
After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

Share on other sites
After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).

Share on other sites

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

Rob's reply was a joke. Ravyne's is not, however. Your program will run faster if it is able to fully utilize the CPU. But simply using floats like Goof suggests is hardly the answer.

After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).

Hahaha got ya!

Share on other sites

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

There is a (naive) truth to it, precisely as I laid out in my post. Inside modern CPUs there are several execution units which can operate independently of one another -- If you interleave instructions which target each of those units correctly then more of them will be busy more often, yielding *potentially* higher performance -- I say "potentially" because your problem and your algorithmic approach must be amenable to this type of optimization to have any hope of benefiting from it -- in other words, if your algorithm is Load/Store bound, then there's only so much that re-ordering its integer math can do. However, if the algorithm is rather "balanced" across functional units, then the order of CPU instructions can be of great benefit -- its common to hide memory latency by pre-fetching data a couple instructions ahead of where its needed, for example (though there are tradeoffs to consider even in this simple case, such as register pressure).

Then there's lots of modern processor features which make the benefit unpredictable, if not unnecessary -- Instruction re-ordering can already alleviate many such order-dependent bottlenecks, hyperthreading can make sure more units are busy running another thread when possible, or dual-thread/in-order cores (like on Xbox 360 or PS3) can execute an alternate thread when the other stalls awaiting memory.

Its one of those ideas where the theory is simple and sound, but which takes a great deal of expertise and a measure of luck to exercise favorably.

Now, if tomorrow Intel decides that the way forward is to throw away our fancy, super-scaler, out-of-order 4-core Sandy Bridge CPUs, and in its place they're going to give us 32 4Ghz Atom processors (er, basically Knight's Ferry) or that we all need to migrate to Itanium, then the effect of instruction ordering becomes more predictable and more important. Otherwise, modern CPUs are pretty darn smart as it is, and will do a decent job all on their own, provided you don't go out of your way to sabotage their efforts.

Share on other sites
@RobTheBloke - pure genius

Share on other sites

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

There is a partial, vagueish truth to what GoofProg.F says (in so much as interleaving instructions can keep the CPU busier), but he is entirely wrong in the suggested approach to exploit this optimisation. To start with, let's just state an absolute truth from which you cannot escape:

Optimisations that work for one series of CPU (eg core i7) don't always translate to improvements on other CPUs (eg Atom) - and some times they can actually hurt performance! So, it is impossible to optimise code to be perfect on all platforms, the best you can do is make sure it doesn't run BADLY on all architectures

Most CPU's have seperate execution units which can (if used correctly) execute instructions in parallel (something that require Out-Of-Order-Execution). A simple CPU (such as the Atom) has two (IIRC), whereas some of the newer CPU's have 5 or more. These execution units are not the same however. So some may do integer/floating point addition, some may do integer/floating point multiplication, etc - but they aren't always done by type (eg some may be float only, some may be an int/float mix). Taking something like the Atom, one execution unit can do mult/div/sqrt, and the other can only do addition/subtraction. This makes sense if you consider this code:

 float a[NUM]; float f = 1.0; for(int i; i<NUM; i++) { f *= a; }

The *= is a floating point mult op. The dereferencing of 'a' is an integer addition operation (since a can be re-written as *(a+i)). So what you have is an int & a float op next to each other. In theory the CPU could execute both at the same time - but in this example that won't actually happen. The problem is that there is a dependency between the result of a and the multiplication. However..... there is also another integer op happening, and that is i++. There is a good chance that the CPU may be able to compute i++ and the *= within the same CPU cycle. Win!

At this point you can unroll the loop slightly (and break a couple of dependencies whilst we are at it). i.e.

 float a[NUM]; float f0=1.0f,f1=1.0f,f2=1.0f,f3=1.0f; for(int i; i<NUM; i+=4) { f0 *= a[i ]; f1 *= a[i+1]; f2 *= a[i+2]; f3 *= a[i+3]; } // if NUM is not a mulitple of 4, don't forget to handle remaining elements here! float f = f0 * f1 * f2 * f3; 

At this point you have a healthy mix of different op types (without dependencies between them) that could, in theory, be executed in parallel by the CPU. So... The temptation is to go and write all of your loops like that. Well. Don't. The compiler is there to do that stuff for you, and may decide to insert multiple versions of that code into the exe (e.g. one for Atom, one for core2, one for i7). By manually unrolling the loop you may make the code run slower on some platforms (for example it's possible that a given CPU benefits more from good instruction cache usage, and therefore less code in the loop will be beneficial). If you start optimising code this way and it brings noticeable benefits, bare in mind you've only seen benefits on YOUR CPU, so you should test on other architectures too (because you may have made accidentally things worse!)

Ultimately it's all a bit of a black art really, led mainly by experience, a good knowlegde of the CPU architecture, and a lot of profiler driven guesstimation.....

Share on other sites

This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.

Not sure who marked this post negatively - It's spot on....

Share on other sites
Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.

Share on other sites

Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.

It's very rarely worth the effort.

If you took a medium sized coding project and applied loop-unrolling and instruction interleaving everywhere, chances are your compiled code size will have increased noticeably. At this point there is an extremely good chance that the overall performance of your app would have dropped a LOT! The problem is that memory is normally the biggest bottleneck on modern systems. Increased code size leads to: more memory reads; more page faults; worse instruction cache usage and so on and so forth.

This is why it is utterly pointless to worry about optimisation until a profiler has told you that you have a problem! If appplied sparingly, in the correct place, these optimisations *can* bring about a big performance increases. However, applying them globally is always a recipe for disaster.

There are some things that are almost always worth optimising though....
* Replacing one algorithm with one better suited to your problem space.
* Minimising memory reads via data compression (decompression cost is usually negligable compared to the time saved by not reading from memory)
* Minimising data dependencies (although the compiler usually does a good job, it rarely hurts to give it hints)
* Minimising FileIO times (compression again - fileIO being orders of magnitude slower than memory IO)
* SIMD / Multi-core optimisations (imho, this is not an optimisation, it is "building in efficiency". There are however limits!)

All of the above allow for huge performance increases, without ending up with an unmaintainable codebase.

After that, the law of diminishing-returns applies. For example, interleaving instructions across the codebase may take a few weeks, introduce some bugs along the way, and end up providing only a single extra frame per second for your efforts (on a game already running at 70fps). I don't completely agree with the "Don't optimise early" adage, my personal take is "Build in efficiency, and only optimise when someone can demonstrate you actually have a problem that needs to be fixed! (And that problem is actually affecting the release builds!)"

Share on other sites

[quote name='Ravyne' timestamp='1316029428' post='4861708']
This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.

Not sure who marked this post negatively - It's spot on....
[/quote]

Off topic, but yeah I kinda wondered too. Its actually pretty interesting that I never find myself downvoted on the basis of my argument (mostly I suppose because I refrain from making strong statements about things I'm less versed in), but I do seem to anger someone now and again (if I know myself to not be factually incorrect, then I can only assume someone is taking revenge for feeling slighted in some way) -- or maybe someone's finger slipped, in which case the forum software fails for not letting people change their votes.

Its really too bad when that happens though, because someone will come along 6 months from now and assume a post was down-rated for good reason, and so if it was done intentionally, then someone has done some small amount of damage to the community in order to avenge their own bruised ego. My rating is high enough that the effect on me personally is negligable, its the implicit misinformation that upsets me and the fact that I took time out of my day to provide a reasoned answer, and some nitwit just called it into question with the click of a button. Such is the internet, I suppose.

I assume it was you who helped fix it, so thanks. I've also taken on a Robin Hood-like role where I will help right obvious misuse of the rating system.