Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


my concept - the importance of using floating point


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
18 replies to this topic

#1 GoofProg.F   Members   -  Reputation: 92

Like
0Likes
Like

Posted 14 September 2011 - 09:56 AM

A computer game should take advantage of using floating point numbers because the over use of integer numbers may actually overload the CPU for math processing with the FPU potion can be used for math processing. Feedback please. What do you guys think? Let me start on why. A small computer game will never really see a performance hit but all the new mathemathics and physics used now probably is expected to go to a special module to take care of the processing, but maybe those modules are not really needed because the processors already have a portion of that ability inside of their MMX,SSE, and FPU processing.

thanks.
General Studies A.S - College of Southern Nevada 2003 GPA 2.3

Sponsor:

#2 Hodgman   Moderators   -  Reputation: 31210

Like
0Likes
Like

Posted 14 September 2011 - 10:06 AM

Daddy, I'm cold. Can you build me a campfire?

#3 Moe   Crossbones+   -  Reputation: 1248

Like
0Likes
Like

Posted 14 September 2011 - 10:27 AM

You are too concerned about trivial things like this. If you are pushing a machine that hard, there are more likely other ways to improve performance. Besides, any sort of gain would be negligible.

#4 kunos   Crossbones+   -  Reputation: 2207

Like
0Likes
Like

Posted 14 September 2011 - 10:30 AM

Feedback please.



you should rethink your whole strategy dog



Stefano Casillo
Lead Programmer
TWITTER: @KunosStefano
AssettoCorsa - netKar PRO - Kunos Simulazioni

#5 Tachikoma   Members   -  Reputation: 552

Like
0Likes
Like

Posted 14 September 2011 - 10:48 AM

MMX and SSE registers can be used for integer processing btw, not just floating points. But your primary concern should be about using the right data format for the right job, as opposed to doing premature optimisations.
Latest project: Sideways Racing on the iPad

#6 FLeBlanc   Crossbones+   -  Reputation: 3109

Like
0Likes
Like

Posted 14 September 2011 - 10:58 AM

This guy is kind of weird.

#7 RobTheBloke   Crossbones+   -  Reputation: 2349

Like
9Likes
Like

Posted 14 September 2011 - 12:10 PM

A computer game should take advantage of using floating point numbers because the over use of integer numbers may actually overload the CPU for math processing with the FPU potion can be used for math processing. Feedback please.


Absolutely correct, however you have forgotten about mixing in the other variable types too. What you should be aiming for, is a healthy mix of bits, bytes, shorts, ints, int64's, floats, doubles and strings (don't forget SSE4 has string methods now!). So the best way to extract the maximum performance is to ensure you always mix different types in your calculations as much as possible. So let's start with a very basic example:

struct ivec2
{
  int x;
  int y;
};

ivec2 add(const ivec2& a, const ivec2& b)
{
  ivec2 c;
  c.x = a.x + b.x;  c.y = a.y + b.y;
  return c;
}


Now, a seasoned professional developer will notice immediately that you are overloading the integer portion of the cpu in the add method. The best way to reduce this overhead is to insert some double precision operations in there (the trig functions in math.h are particularly useful for this - doing maths on constant numbers will just get removed by compiler optimisations, so make sure the compiler won't remove any of the code and use variables). Anyhow, here is how a seasoned pro would go about optimising that method. First lets try to balance the heavy integer load on the CPU, with some nice cheap double ops:


ivec2 add(const ivec2& a, const ivec2& b)
{
  ivec2 c;
  sin( (double)a.x );
  c.x = a.x + b.x;
  cos( (double)a.y );
  c.y = a.y + b.y;
  acos( (double)b.x );
  return c;

}


However, since the standard library trig functions are only using the double portions of the CPU, the CPU usage is still very unbalanced. We can simply add in some calls to floating point trig methods here which will help out somewhat! i.e.



ivec2 add(const ivec2& a, const ivec2& b)
{
  ivec2 c;
  sin( (double)a.x );
  sinf( (float)a.y );
  c.x = a.x + b.x; 
  cos( (double)b.x );
  sinf( (float)b.y );
  c.y = a.y + b.y;
  acos( (double)a.x );
  acosf( (float)a.y );
  return c;
}


Obviously, we still aren't using any boolean ops at this point, so a really handy trick is to combine the calls in a meaningless comparison. That way the boolean portion of the CPU becomes load balanced too!

ivec2 add(const ivec2& a, const ivec2& b)
{
  short hack=0; //< make use of shorts too! Just here so nothing gets optimised away.... 
  ivec2 c;
  if( sin( (double)a.x ) < sinf( (float)a.y ) )  //< resolves 
 	hack = 1;
  else //< always use an else statement. It makes sure the branch predictor is being load balanced as well as the double portions of the CPU. 
 	hack = 3;
  c.x = a.x + b.x;
  if( cos( (double)b.x ) < cosf( (float)b.y ) )
 	hack = 2;
  else
 	hack = 5;
  c.y = a.y + b.y;
  if( acos( (double)a.x ) < acosf( (float)a.y ) )
 	hack = 3;
  else
 	hack = 6;
  return c;
}


As a final pass, we can add in some really cheap string ops for the benefit of those people with SSE4.1 capable CPUs. i.e.

ivec2 add(const ivec2& a, const ivec2& b)
{
  std::string tempString; //< using std::string everywhere is always a massive performance win. Try it and see!
  short hack=0;
  ivec2 c;
  tempString = "This will ";
  if( sin( (double)a.x ) < sinf( (float)a.y ) )
 	hack = 1;
  else
 	hack = 4;
  c.x = a.x + b.x;  tempString += " make things ";
  if( cos( (double)b.x ) < cosf( (float)b.y ) )
 	hack = 2;
  else
 	hack = 5;
  c.y = a.y + b.y;
  tempString += " really really fast! You'll be amazed at the net gain when you run this through a profiler!";
  if( acos( (double)a.x ) < acosf( (float)a.y ) )
 	hack = 3;
  else
 	hack = 6;
  return c;

}


I work in middleware, and I can assure you that we optimise every single method in exactly the same way. The biggest problem with this kind of optimisation though, is that it requires a super experienced programmer to make it work properly. An inexperienced programmer may naively attempt the above optimisations, but end up with something significantly slower than they started with. Normally therefore, it's recommended that you always leave optimisation work for a compiler to do.

A final word of warning: There is so much mis-information and downright lies in internet forums about optimisation techniques, that imho it is best to assume that every forum post is full of mis-information and lies. The only way to know if something is true or not is to verify it in a profiler.....

#8 Cornstalks   Crossbones+   -  Reputation: 6991

Like
0Likes
Like

Posted 14 September 2011 - 01:18 PM

A final word of warning: There is so much mis-information and downright lies in internet forums about optimisation techniques, that imho it is best to assume that every forum post is full of mis-information and lies. The only way to know if something is true or not is to verify it in a profiler.....

Profiled it. You're right, the final method you proposed gave me a 200x performance boost. Thanks!
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#9 Ravyne   GDNet+   -  Reputation: 7880

Like
2Likes
Like

Posted 14 September 2011 - 01:43 PM

This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.

#10 Wooh   Members   -  Reputation: 638

Like
0Likes
Like

Posted 14 September 2011 - 02:07 PM

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

#11 apatriarca   Crossbones+   -  Reputation: 1748

Like
1Likes
Like

Posted 14 September 2011 - 03:05 PM

After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).

#12 Cornstalks   Crossbones+   -  Reputation: 6991

Like
0Likes
Like

Posted 14 September 2011 - 03:51 PM

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.

Rob's reply was a joke. Ravyne's is not, however. Your program will run faster if it is able to fully utilize the CPU. But simply using floats like Goof suggests is hardly the answer.

After reading Cornstalks reply I have decided to test it and it comes out quite unsurprisingly that the last version of add is two order of magnitude slower than the naive method on my computer (release mode and all optimizations and SSE turned on).

Hahaha got ya! :)
[ I was ninja'd 71 times before I stopped counting a long time ago ] [ f.k.a. MikeTacular ] [ My Blog ] [ SWFer: Gaplessly looped MP3s in your Flash games ]

#13 Ravyne   GDNet+   -  Reputation: 7880

Like
0Likes
Like

Posted 14 September 2011 - 03:58 PM

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.


There is a (naive) truth to it, precisely as I laid out in my post. Inside modern CPUs there are several execution units which can operate independently of one another -- If you interleave instructions which target each of those units correctly then more of them will be busy more often, yielding *potentially* higher performance -- I say "potentially" because your problem and your algorithmic approach must be amenable to this type of optimization to have any hope of benefiting from it -- in other words, if your algorithm is Load/Store bound, then there's only so much that re-ordering its integer math can do. However, if the algorithm is rather "balanced" across functional units, then the order of CPU instructions can be of great benefit -- its common to hide memory latency by pre-fetching data a couple instructions ahead of where its needed, for example (though there are tradeoffs to consider even in this simple case, such as register pressure).

Then there's lots of modern processor features which make the benefit unpredictable, if not unnecessary -- Instruction re-ordering can already alleviate many such order-dependent bottlenecks, hyperthreading can make sure more units are busy running another thread when possible, or dual-thread/in-order cores (like on Xbox 360 or PS3) can execute an alternate thread when the other stalls awaiting memory.

Its one of those ideas where the theory is simple and sound, but which takes a great deal of expertise and a measure of luck to exercise favorably.

Now, if tomorrow Intel decides that the way forward is to throw away our fancy, super-scaler, out-of-order 4-core Sandy Bridge CPUs, and in its place they're going to give us 32 4Ghz Atom processors (er, basically Knight's Ferry) or that we all need to migrate to Itanium, then the effect of instruction ordering becomes more predictable and more important. Otherwise, modern CPUs are pretty darn smart as it is, and will do a decent job all on their own, provided you don't go out of your way to sabotage their efforts.

#14 GregMichael   Members   -  Reputation: 143

Like
0Likes
Like

Posted 14 September 2011 - 04:07 PM

@RobTheBloke - pure genius :lol:

#15 RobTheBloke   Crossbones+   -  Reputation: 2349

Like
2Likes
Like

Posted 15 September 2011 - 09:07 AM

After reading RobTheBloke's post I'm a bit confused. I think he is joking but is Ravyne joking? Is there any truth in what GoofProg.F says? It seems like no one takes him serious.


There is a partial, vagueish truth to what GoofProg.F says (in so much as interleaving instructions can keep the CPU busier), but he is entirely wrong in the suggested approach to exploit this optimisation. To start with, let's just state an absolute truth from which you cannot escape:

Optimisations that work for one series of CPU (eg core i7) don't always translate to improvements on other CPUs (eg Atom) - and some times they can actually hurt performance! So, it is impossible to optimise code to be perfect on all platforms, the best you can do is make sure it doesn't run BADLY on all architectures

Most CPU's have seperate execution units which can (if used correctly) execute instructions in parallel (something that require Out-Of-Order-Execution). A simple CPU (such as the Atom) has two (IIRC), whereas some of the newer CPU's have 5 or more. These execution units are not the same however. So some may do integer/floating point addition, some may do integer/floating point multiplication, etc - but they aren't always done by type (eg some may be float only, some may be an int/float mix). Taking something like the Atom, one execution unit can do mult/div/sqrt, and the other can only do addition/subtraction. This makes sense if you consider this code:

float a[NUM];
float f = 1.0;

for(int i; i<NUM; i++)
{
  f *= a[i];
}

The *= is a floating point mult op. The dereferencing of 'a' is an integer addition operation (since a can be re-written as *(a+i)). So what you have is an int & a float op next to each other. In theory the CPU could execute both at the same time - but in this example that won't actually happen. The problem is that there is a dependency between the result of a[i] and the multiplication. However..... there is also another integer op happening, and that is i++. There is a good chance that the CPU may be able to compute i++ and the *= within the same CPU cycle. Win!

At this point you can unroll the loop slightly (and break a couple of dependencies whilst we are at it). i.e.

float a[NUM];
float f0=1.0f,f1=1.0f,f2=1.0f,f3=1.0f;

for(int i; i<NUM; i+=4)
{
  f0 *= a[i  ];
  f1 *= a[i+1];
  f2 *= a[i+2];
  f3 *= a[i+3];
}

// if NUM is not a mulitple of 4, don't forget to handle remaining elements here!

float f = f0 * f1 * f2 * f3;

At this point you have a healthy mix of different op types (without dependencies between them) that could, in theory, be executed in parallel by the CPU. So... The temptation is to go and write all of your loops like that. Well. Don't. The compiler is there to do that stuff for you, and may decide to insert multiple versions of that code into the exe (e.g. one for Atom, one for core2, one for i7). By manually unrolling the loop you may make the code run slower on some platforms (for example it's possible that a given CPU benefits more from good instruction cache usage, and therefore less code in the loop will be beneficial). If you start optimising code this way and it brings noticeable benefits, bare in mind you've only seen benefits on YOUR CPU, so you should test on other architectures too (because you may have made accidentally things worse!)

Ultimately it's all a bit of a black art really, led mainly by experience, a good knowlegde of the CPU architecture, and a lot of profiler driven guesstimation.....

#16 RobTheBloke   Crossbones+   -  Reputation: 2349

Like
0Likes
Like

Posted 15 September 2011 - 09:09 AM

This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.


Not sure who marked this post negatively - It's spot on....

#17 Tachikoma   Members   -  Reputation: 552

Like
0Likes
Like

Posted 15 September 2011 - 09:49 AM

Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.
Latest project: Sideways Racing on the iPad

#18 RobTheBloke   Crossbones+   -  Reputation: 2349

Like
0Likes
Like

Posted 15 September 2011 - 11:38 AM

Some of the optimisation hacks look interesting and completely crazy. They make sense, but I'd have to question whether doing something like this worth the effort and the maintainability. Just imagine the poor sod seeing such code for the first time - I wouldn't blame him for thinking the original author lost his marbles.


It's very rarely worth the effort.

If you took a medium sized coding project and applied loop-unrolling and instruction interleaving everywhere, chances are your compiled code size will have increased noticeably. At this point there is an extremely good chance that the overall performance of your app would have dropped a LOT! The problem is that memory is normally the biggest bottleneck on modern systems. Increased code size leads to: more memory reads; more page faults; worse instruction cache usage and so on and so forth.

This is why it is utterly pointless to worry about optimisation until a profiler has told you that you have a problem! If appplied sparingly, in the correct place, these optimisations *can* bring about a big performance increases. However, applying them globally is always a recipe for disaster.

There are some things that are almost always worth optimising though....
* Replacing one algorithm with one better suited to your problem space.
* Minimising memory reads via better data structures (eg Oct/Quad/Kd Trees)
* Minimising memory reads via data compression (decompression cost is usually negligable compared to the time saved by not reading from memory)
* Minimising data dependencies (although the compiler usually does a good job, it rarely hurts to give it hints)
* Minimising FileIO times (compression again - fileIO being orders of magnitude slower than memory IO)
* SIMD / Multi-core optimisations (imho, this is not an optimisation, it is "building in efficiency". There are however limits!)

All of the above allow for huge performance increases, without ending up with an unmaintainable codebase.

After that, the law of diminishing-returns applies. For example, interleaving instructions across the codebase may take a few weeks, introduce some bugs along the way, and end up providing only a single extra frame per second for your efforts (on a game already running at 70fps). I don't completely agree with the "Don't optimise early" adage, my personal take is "Build in efficiency, and only optimise when someone can demonstrate you actually have a problem that needs to be fixed! (And that problem is actually affecting the release builds!)"

#19 Ravyne   GDNet+   -  Reputation: 7880

Like
0Likes
Like

Posted 15 September 2011 - 05:17 PM


This is true in as much as its obviously a perf-gain if you can keep all the various execution resources as busy as possible. One of the reasons that Quake, and later, Unreal, were able to do things that no one else was doing in software rendering was that they kept the Pentium's U and V pipe both pretty busy, along with MMX where available.

Conceptually, this is easy, but in practice it is very hard -- it requires a knowledge of instructions and compilers that most people don't have, and with out-of-order execution, wider super-scalar architectures, operation fusion techniques, and different issue/resolve latencies for each instruction, its basically impossible for a human to accomplish at anything more finely-grained than, say, the major execution units: load/store, integer, FPU (simple and complex), and SIMD.

Yes, highly performant code should be aware of these things, but no, this is no epiphany that you've bestowed upon the world.


Not sure who marked this post negatively - It's spot on....


Off topic, but yeah I kinda wondered too. Its actually pretty interesting that I never find myself downvoted on the basis of my argument (mostly I suppose because I refrain from making strong statements about things I'm less versed in), but I do seem to anger someone now and again (if I know myself to not be factually incorrect, then I can only assume someone is taking revenge for feeling slighted in some way) -- or maybe someone's finger slipped, in which case the forum software fails for not letting people change their votes.

Its really too bad when that happens though, because someone will come along 6 months from now and assume a post was down-rated for good reason, and so if it was done intentionally, then someone has done some small amount of damage to the community in order to avenge their own bruised ego. My rating is high enough that the effect on me personally is negligable, its the implicit misinformation that upsets me and the fact that I took time out of my day to provide a reasoned answer, and some nitwit just called it into question with the click of a button. Such is the internet, I suppose.


I assume it was you who helped fix it, so thanks. I've also taken on a Robin Hood-like role where I will help right obvious misuse of the rating system.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS