Jump to content

  • Log In with Google      Sign In   
  • Create Account


Math API performance: saving CPU cycles?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
22 replies to this topic

#1 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 11:14 AM

I'm trying not to go overboard with micro-optimizations for my engine's math API, but I am trying to put some consideration into performance. For example, it's my understanding than multiplication is slightly faster than division, and saves a few CPU cycles here and there; and this can add up when high-frequency code is executing over and over in a game loop. So I have done things like this example from my Matrix structure:

[source lang="csharp"] public static Matrix operator /(Matrix mat, float div) {#if PERFORM_CHECKS if (div == 0) throw new MathematicalException( "Divisor is zero.", new DivideByZeroException());#endif float num = 1f / div; var result = Matrix.Identity; result.M11 = mat.M11 * num; result.M12 = mat.M12 * num; result.M13 = mat.M13 * num; result.M14 = mat.M14 * num; result.M21 = mat.M21 * num; result.M22 = mat.M22 * num; result.M23 = mat.M23 * num; result.M24 = mat.M24 * num; result.M31 = mat.M31 * num; result.M32 = mat.M32 * num; result.M33 = mat.M33 * num; result.M34 = mat.M34 * num; result.M41 = mat.M41 * num; result.M42 = mat.M42 * num; result.M43 = mat.M43 * num; result.M44 = mat.M44 * num; return result; }[/source]

Is this correct/true, and should I be doing it this way? And what other optimizations might I use in general to make my math code blazing fast and efficient?

Might I even consider doing something like this:

[source lang="csharp"]#if !PERFORM_CHECKSunchecked {#endif// math code here...#if !PERFORM_CHECKS}#endif[/source]
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

Sponsor:

#2 SimonForsman   Crossbones+   -  Reputation: 5770

Like
1Likes
Like

Posted 25 September 2012 - 11:21 AM

That looks like something you should use SIMD for tbh.
I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

#3 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 11:35 AM

That looks like something you should use SIMD for tbh.


Well, the problem with that is, I think, that SIMD is CPU-specific; is it not? And I'm writing a C# engine. So dealing with that fact would make this difficult...

Furthermore, there are only two ways I know of to execute arbitrary machine code from C#. The first way is the "standard" way, using the interop layer. But that will incur a performance penalty that may defeat the whole purpose. The only other way I know to execute "pure" machine code is a bit complicated and may not speed things up very much to be worth it. But essentially, this is how it's done:

First you create a pool of unmanaged memory and "trick" the CLR into believing it is executable code. One way is to use the Reflection API and treat it as a module. Then you obtain an address to some place in the memory that is safe to write to. You then emit machine code instructions (in raw bytes) and copy it into memory. Then you use Marshal class to obtain a delegate wrapping the function pointer. Then you can call the code and it will indeed work. I did this before just for giggles, and I actually have the project saved somewhere on my old HDD. But doing all this seems way overkill, and I doubt it would be worth it. And I'd have to figure out a way to emit the correct machine code for every single processor architecture this engine could conceivably be used on. As this is a platform-agnostic engine, that is no small undertaking.
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#4 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 11:47 AM

What about interoperating my engine with Intel's Math Kernel Libraries? Would the performance gain be worth it? And will it be portable across not only computer platforms but consoles and mobile devices as well?
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#5 Álvaro   Crossbones+   -  Reputation: 11865

Like
1Likes
Like

Posted 25 September 2012 - 12:20 PM

Is this correct/true, and should I be doing it this way?

That's probably correct/true. However, the only way to know for sure is to test it in a program. If you are writing a Math library without a project that uses it, I would say you are doing it wrong. In the same manner as game engines are usually created by extracting and polishing the parts of a game that can be reused in other games, a Math library should be created by extracting and polishing the Math-related code that can be reused in other projects. But if you don't have a project to start, how are you going to test your library and how are you going to know what would be useful or not?

And what other optimizations might I use in general to make my math code blazing fast and efficient?

Don't use C#. Posted Image

Edited by alvaro, 25 September 2012 - 12:21 PM.


#6 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 12:47 PM

That's probably correct/true. However, the only way to know for sure is to test it in a program. If you are writing a Math library without a project that uses it, I would say you are doing it wrong. In the same manner as game engines are usually created by extracting and polishing the parts of a game that can be reused in other games, a Math library should be created by extracting and polishing the Math-related code that can be reused in other projects. But if you don't have a project to start, how are you going to test your library and how are you going to know what would be useful or not?


You're right, and I am using a test project. I'm not merely writing a math library, I'm writing an engine. And it's a pretty large project. It's a major pain to go in a perform micro-testing on every little algorithm, so I try to do things right from the start and then go in and actually do all those micro-tests and micro-optimizations every other week; it usually takes a day or a few days dedicated to that and that alone.

Don't use C#. Posted Image


Lol, c'mon... XD

IIRC, last time I did a head-to-head "race" of C# vs C math code the difference was negligible (often at or a near a tie) with CLR checks turned off. After all, once the code is run through JIT it is native code. That's why when I performance test C# code I will run one iteration of the test first and ignore/throw away the results; to "pre-JIT" the code...as the first time it runs it incurs an overhead subsequent tests will not. I was recently talking about that in another thread. C# is by no means slow or "less powerful". Whereas I'm losing a little speed in some areas of the engine, I'm going to win in the overall picture. The memory efficieny, stability and reduced complexity of engine internals makes this thing perform at a rather breath-taking speed.

For example, a few years ago I ran a test of a prototype of this engine which wasn't nearly as good/optimized as this commercial WIP version. I brute-force renderer a terrain made up of several million tris with complex multi-texture blending shaders, normal mapping and lighting... It was running about 7500fps despite the scene being so "heavy". :)

Edited by ATC, 25 September 2012 - 12:52 PM.

_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#7 SimonForsman   Crossbones+   -  Reputation: 5770

Like
1Likes
Like

Posted 25 September 2012 - 02:01 PM


That looks like something you should use SIMD for tbh.


Well, the problem with that is, I think, that SIMD is CPU-specific; is it not? And I'm writing a C# engine. So dealing with that fact would make this difficult...

Furthermore, there are only two ways I know of to execute arbitrary machine code from C#. The first way is the "standard" way, using the interop layer. But that will incur a performance penalty that may defeat the whole purpose. The only other way I know to execute "pure" machine code is a bit complicated and may not speed things up very much to be worth it. But essentially, this is how it's done:



You could just use Mono.SIMD it works with .Net aswell(allthough .Net users don't get actual SIMD support AFAIK) and should work with most modern x86 CPUs.

Edited by SimonForsman, 25 September 2012 - 02:03 PM.

I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!

#8 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 02:17 PM

You could just use Mono.SIMD it works with .Net aswell(allthough .Net users don't get actual SIMD support AFAIK) and should work with most modern x86 CPUs.


Interesting. I'll look into this.

But anyone know anything about Intels MKL? It sounds like it can be pretty darn fast.
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#9 swiftcoder   Senior Moderators   -  Reputation: 9585

Like
1Likes
Like

Posted 25 September 2012 - 03:03 PM

It's a major pain to go in a perform micro-testing on every little algorithm, so I try to do things right from the start and then go in and actually do all those micro-tests and micro-optimizations every other week; it usually takes a day or a few days dedicated to that and that alone.

You need to come up with a better test framework. It shouldn't be necessary to spend days running tests and profiling.

You should absolutely have unit tests that can be run after every build. These are to check the correctness of each of your math functions.

But you probably also want to consider integrating profiling with your build. Flip a switch, and it will compile an executable with profiling built-in. Then you can run real-world tests with your actual application, to ensure that you are hitting your performance goals.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#10 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 25 September 2012 - 04:20 PM

You need to come up with a better test framework. It shouldn't be necessary to spend days running tests and profiling.

You should absolutely have unit tests that can be run after every build. These are to check the correctness of each of your math functions.

But you probably also want to consider integrating profiling with your build. Flip a switch, and it will compile an executable with profiling built-in. Then you can run real-world tests with your actual application, to ensure that you are hitting your performance goals.


You're absolutely right. I just haven't gotten a chance to write a good test framework yet. Most of the math code is coming from a past prototype which was already tested throroughly, so I know the output for everything is correct in its current state. However, I DO need to create a new test framework to do it all over again. But I decided to make this thread first to get ideas on how to optimize things before I start changing or rewriting anything. BTW, when I was talking about taking "several days" I wasn't talking about simply testing things; that's pretty quick. I was talking about all the work to use the test result to do micro-optimizations, rewrite things and make significant changes...then test again to make sure the changes actually worked, didn't break anything and actually resulted in a performance net gain.

Explain to me how you think I should do my profiling builds in as much detail as you're willing or have time to go into. It's an area I'm definitely no expert in.

BTW, do you know anything about Intel's MKL and how it might be beneficial to my engine project?
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#11 swiftcoder   Senior Moderators   -  Reputation: 9585

Like
1Likes
Like

Posted 25 September 2012 - 09:22 PM

You're absolutely right. I just haven't gotten a chance to write a good test framework yet.

I'm not sure I'd bother writing a full test framework for a mid-size C# project. The standard toolchain has enough support to do the basics.

Code coverage reports are a very useful tool, but keep in mind that good tests are worth more than high coverage.

But I decided to make this thread first to get ideas on how to optimize things before I start changing or rewriting anything.

I'm a huge fan of not performing premature optimisations. I might legitimately be accused of being pedantic on the subject, but in my view you shouldn't start optimising until you have firm profiling data to show the code in question is in fault. Programmers spend an awful lot of time optimising the wrong code.

Explain to me how you think I should do my profiling builds in as much detail as you're willing or have time to go into. It's an area I'm definitely no expert in.

If you have access to a good external profiler, that is usually your best option. Apple's Instruments (formerly, Shark) is unparalleled in this area, but there are workable solutions for most platforms/languages (I haven't used Visual Studio's built-in profiler, might be good enough).

Failing that, you get to manually instrument your code. Add code to perform timing (and log the results!) for all major subsystems, and you can add additional sample points as needed to drill down into performance 'hot areas'.

BTW, do you know anything about Intel's MKL and how it might be beneficial to my engine project?

No, but it looks interesting. I've mucked with SIMD/Altivec intrinsics before, and it's all a bit of a pain between cross-platform and different hardware revisions. MKL might help out with those issues, but I'm not thrilled that it is strongly-coupled to the Intel compiler suite on Mac.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#12 clb   Members   -  Reputation: 1777

Like
1Likes
Like

Posted 26 September 2012 - 03:14 AM

Answering the original question..

For example, it's my understanding than multiplication is slightly faster than division, and saves a few CPU cycles here and there

Is this correct/true, and should I be doing it this way? And what other optimizations might I use in general to make my math code blazing fast and efficient?


Assuming that the SSE instruction set instead of the old FP87 stack is used, then a single-precision float scalar division (DIVSS instruction) has a latency of 14-32 cycles and a processing time of 14-32 cycles, depending on the architecture. Double-precision float scalar division (DIVSD) has a latency of 22-39 cycles and a processing time of 20-39 cycles.

Compare to multiplication: a single-precision float scalar multiplication (MULSS) has a latency of 4-7 cycles, and a processing delay of 1-2 cycles, and double-precision scalar multiplication (MULSD) has a latency of 5-7 cycles and a processing time of 1-2 cycles.

The figures were taken from Intel Intrinsic Guide.

So, multiplication is about 20 times faster (assuming perfectly pipelined instructions).

I'm ignoring here the fact that you're not using C/C++ and direct SSE asm/intrinsics, and instead use C#, but the point is that 'yes, division is considerably slower *for the CPU* to execute even on modern CPUs than multiplication'. Whether that can be seen in C# execution environment, is then a matter of profiling.

MathGeoLib uses this 'multiplication by inverse' form, as do most of the game math libraries I've seen as well. Note that x / s versus x * (1/s) are not arithmetically identical, since first computing the inverse as a float and multiplying by it does lose some precision.

And what other optimizations might I use in general to make my math code blazing fast and efficient?


It should be noted that in C/C++ both a single function call, or an 'if' statement are far slower than performing a single division. However, again, in the context of C#, I recommend profiling in your real application hotspot to see what kind of effects these are, since that's quite a different context than low-level C code on the assembly/intrinsic level.
Me+PC=clb.demon.fi | C++ Math and Geometry library: MathGeoLib, test it live! | C++ Game Networking: kNet | 2D Bin Packing: RectangleBinPack | Use gcc/clang/emcc from VS: vs-tool | Resume+Portfolio | gfxapi, test it live!

#13 rdragon1   Crossbones+   -  Reputation: 1173

Like
0Likes
Like

Posted 26 September 2012 - 01:03 PM

After all, once the code is run through JIT it is native code.


Just because it "is" native code does not mean it's the same as compiling C++ and running it through an extensive suite of optimization passes that non-JIT compilers have the freedom (time) to perform. Not to mention global program optimizations that can be run if you use link time code generation.

Yes, JITted code is better than an interpreter. No, it's not usually as good as real compiled code. The C#/.NET ecosystem is (intentionally) optimized for productivity.

#14 swiftcoder   Senior Moderators   -  Reputation: 9585

Like
2Likes
Like

Posted 26 September 2012 - 01:07 PM

No, it's not usually as good as real compiled code.

And sometimes it's considerably better than 'real' compiled code. There are a number of optimisations available to (tracing) JITs, that a ahead-of-time compiler just doesn't have enough data to perform.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#15 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 26 September 2012 - 02:54 PM

And sometimes it's considerably better than 'real' compiled code. There are a number of optimisations available to (tracing) JITs, that a ahead-of-time compiler just doesn't have enough data to perform.


True. And you've probably heard me saying that over and over in various conversations! :-)

That's why I decided to write a C# engine instead of a C++ or C engine. Games and engines get so complex that memory leaks and inefficiencies seem to be "business as usual" in the field. Pretty much every commercial game I play has some fatal bugs and wastes memory (e.g., Skyrim CTDs like crazy, ARMA II will consume vast amounts of resources on a pause menu, etc). It's my theory (and I have partially confirmed it through tests) that in the end I can beat most native engines and game systems in not only speed but efficiency, smaller memory footprints, stability and reduced bloat. Various tests and prototypes over the past few years have continued to confirm my theories/suspicions. I also worked at a company with the same philosophy and their engine turned out to be success. It's not because C# is a "better" language, but because we're human. For instance, we get lazy and do copy+paste jobs on code and we might put "*pObj1" instead of "*pObj2" and introduce a memory leak nearly impossible to find. C and C++ also requires us to write huge amounts of code to do very basic things like initializing a game window. Of course, there's a huge amount of code running behind the scenes of a .NET application, but it's huge amounts of code written by hundreds of professionals at Microsoft and rigorously tested over the course of these many years of .NET's lifespan. I'd rather have that backing up the boilerplate level of my engine and applications than something I cobbled together over the course of a few days. The nay-sayers are going to be quite surprised to see what C# is capable over the next few years; I'm not the only one with this philosophy nor the only one working on a large C# project in the game dev field.

Anyway, I'm going to take your advice on profiling to heart and I will be soon looking into Intel's MKL to see what benefits it might provide for me. Perhaps I can make make it an optional back-end for my math APIs and let developers choose when to use it (or not).
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#16 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 26 September 2012 - 03:07 PM

I'm ignoring here the fact that you're not using C/C++ and direct SSE asm/intrinsics, and instead use C#, but the point is that 'yes, division is considerably slower *for the CPU* to execute even on modern CPUs than multiplication'. Whether that can be seen in C# execution environment, is then a matter of profiling.


I looked into this and the JIT compiler does spit out arithmetic code in a rather transparent and expected manner. And if you tick the "Optimize Code" option for the compiler it can often figure out ways to squeeze a few more ticks of speed out of things.
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#17 metsfan   Members   -  Reputation: 652

Like
0Likes
Like

Posted 26 September 2012 - 08:05 PM

And sometimes it's considerably better than 'real' compiled code. There are a number of optimisations available to (tracing) JITs, that a ahead-of-time compiler just doesn't have enough data to perform.


This is a poor example, as is shown in the comments by several posters. This test proves nothing because we don't see the output assembly, so we have no idea if these two programs are ACTUALLY doing the same thing. A poster about halfway through the page posts another benchmark which is more realistic and the results show that the c implementation is about 12x faster than the PyPy implementation.

C# is a good language. It is powerful, it helps you get off the ground running faster for sure. The biggest Issues I see with C# though is that it's much more difficult to build cross platform, and for extreme performance cases you may not be able to nail the performance to the level you want (i.e. a physics engine for a FPS, or games with high levesl of network activity such as an MMOG). I'm always willing to be proved wrong though, so if I am wrong, have it me =)

#18 swiftcoder   Senior Moderators   -  Reputation: 9585

Like
0Likes
Like

Posted 26 September 2012 - 10:40 PM

This test proves nothing because we don't see the output assembly, so we have no idea if these two programs are ACTUALLY doing the same thing. A poster about halfway through the page posts another benchmark which is more realistic and the results show that the c implementation is about 12x faster than the PyPy implementation.

That's honestly pretty irrelevant. We always knew that a good low-level programmer can achieve better performance through hand-coded assembly, but few people have that kind of skill set (or the time and energy to apply it to their entire project). The fact that the general case is faster for the JIT than for the AOT compiler, says a lot about how far JITs have come.

The biggest Issues I see with C# though ... for extreme performance cases you may not be able to nail the performance to the level you want (i.e. a physics engine for a FPS, or games with high levesl of network activity such as an MMOG).

MMOGs generally have very low amounts of network traffic compared to an FPS, but leaving that aside, it's not really a problem. For starters, both of those tasks are best left to dedicated middleware (presumably already thoroughly optimised by a 3rd party). And if you do roll your own and find a performance problem that is insurmountable in C#, well, there is always the managed-C++ bridge...

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#19 ATC   Members   -  Reputation: 551

Like
0Likes
Like

Posted 26 September 2012 - 11:18 PM

C# is a good language. It is powerful, it helps you get off the ground running faster for sure. The biggest Issues I see with C# though is that it's much more difficult to build cross platform, and for extreme performance cases you may not be able to nail the performance to the level you want (i.e. a physics engine for a FPS, or games with high levesl of network activity such as an MMOG). I'm always willing to be proved wrong though, so if I am wrong, have it me =)


There are actually sets of very good emperical benchmarks and tests out there (I think some were performed by Intel) on the web that show C# code blowing C and C++ code out of the water in certain cases. It's been a few years since I read them, but that's part of what got me into C#. I'll try to find it again.
_______________________________________________________________________________
CEO & Lead Developer at ATCWARE™
"Project X-1"; a 100% managed, platform-agnostic game & simulation engine


Please visit our new forums and help us test them and break the ice!
___________________________________________________________________________________

#20 metsfan   Members   -  Reputation: 652

Like
1Likes
Like

Posted 27 September 2012 - 08:41 AM

That's honestly pretty irrelevant. We always knew that a good low-level programmer can achieve better performance through hand-coded assembly, but few people have that kind of skill set (or the time and energy to apply it to their entire project). The fact that the general case is faster for the JIT than for the AOT compiler, says a lot about how far JITs have come.


I didn't mean that a programmer could achieve better results through hand coded assembly, I think the point that I and those commenters were making is that unless you post the output assembly for the benchmarks, there's no way of knowing if the computer is really doing the same thing in both cases. If I understood correctly what people were saying was because the benchmark test was just buffering a string and never actually doing anything with it, that the JIT compiler was likely optimizing out the entire benchmark test. However, without the raw assembly output, we'll never know. That's all I was saying.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS