Jump to content
  • Advertisement
Sign in to follow this  
King_DuckZ

Optimizations in debug

This topic is 2437 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Recently I rewrote the matrix class we use here at work. The old one was really a collection of classes, like matrix4x4, matrix4x3 etc, a very long copy and paste with little changes here and there. My approach was to go template.

Being this the case, the old multiplication code was a manually unrolled loop, like a00 * b00 + a01 * b10 etc, while the new one is the more classical and generic "3 nested loops". Now, the generated optimized code is many times faster than the old code, which is cool. The problem is that in debug mode all the loops are kept, no inlining is performed and the resulting code is 6-7 times slower than the old one (which was already slow due to the many cache misses).

I know that debug is normally slower than release, but in order to keep the framerate at an acceptable rate for the others I'd like my code to be a bit faster in debug. I thought I could surround my function with a #pragma optimize, but that would be VS-specific and maybe there are other downsides I'm not aware of. Any suggestion?

I'm seeing if I can use SIMD intrinsics instead, but I'm not sure I'll get to do that due to some issues with our allocator and the 16-bytes alignment.

Share this post


Link to post
Share on other sites
Advertisement
There's [font="Courier New"]#pragma optimisze[/font]-equivalents for the other compilers. If you need to support multiple compilers in a generic way, you could make a header that contains the platform-specific pragmas in it, so you could write, e.g.#include "push_force_optimize.h"
void foo()
{
}
#include "pop_force_optimize.h"

Share this post


Link to post
Share on other sites
If you're using Visual Studio one thing that can make a big difference in debug, especially when calling lot of small functions is to turn off the 'basic runtime checks' in the code generation settings.

Of course you lose some of the benefits of debug mode that way, but I can't remember the last time I wrote code that tripped either of those runtime checks.

Share this post


Link to post
Share on other sites
I tried pragma optimize("gyt", on), as well as "gt" but in both cases I ended up with slower code. Disabling basic runtime checks improved things a lot, it's like 6 times faster. I also found this article, giving some details about the /RTC1 switch (here's something for those unfamiliar with the assembly "rep stosd"). Thanks for helping!

Share this post


Link to post
Share on other sites
Debug SIMD with MSVC is pretty horrific, it ends up writing all the results to memory for each and every instructions, i assume so the debugger can show you the values more easily. A huge amount of those get turned into registers in optimized builds and never hit real ram, plus the scheduling is all different. Straight up FPU code can beat debug SIMD code in a lot of cases its that messed up.

Share this post


Link to post
Share on other sites

Recently I rewrote the matrix class we use here at work. The old one was really a collection of classes, like matrix4x4, matrix4x3 etc, a very long copy and paste with little changes here and there. My approach was to go template.


Don't be so quick to get rid of those classes. If you really want to optimize your code, then the best method is to simply NOT do as much work. A matrix4x3 vs matrix4x4 does exactly that (and since the 4x4 will be doing everything the 4x3 will be doing [+ a little bit more], One would assume you can share code between the classes, with a few additions in the 4x4).


Being this the case, the old multiplication code was a manually unrolled loop, like a00 * b00 + a01 * b10 etc, while the new one is the more classical and generic "3 nested loops". Now, the generated optimized code is many times faster than the old code, which is cool. The problem is that in debug mode all the loops are kept, no inlining is performed and the resulting code is 6-7 times slower than the old one (which was already slow due to the many cache misses).


First thing first. a00 * b00 + a01 * b10 is highly inefficient. For loops aren't any better tbh. Your aim is remove variable dependencies, and use the most linear pattern for memory access as is possible. i.e.

r00 = a00 * b00;
r01 = a00 * b01;
r02 = a00 * b02;
r03 = a00 * b03;

r00 += a01 * b10;
r01 += a01 * b11;
r02 += a01 * b12;
r03 += a01 * b13;

<snip>


Not using SIMD for this is simply another way of murdering polar bears. Save some CO2, use the CPU more efficiently.....

[color="#1C2837"]

I'm seeing if I can use SIMD intrinsics instead, but I'm not sure I'll get to do that due to some issues with our allocator and the 16-bytes alignment.[/quote]
Then do this first:

assert( isAligned(this) && isAligned(&arg0) );

At the start of every matrix / vector method. You'll quickly find all of the locations of mis-aligned structs, and dodgy allocators. Once your program runs without asserting, you can start switching over to SIMD.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!