Optimizations in debug

Started by
4 comments, last by RobTheBloke 12 years, 4 months ago
Recently I rewrote the matrix class we use here at work. The old one was really a collection of classes, like matrix4x4, matrix4x3 etc, a very long copy and paste with little changes here and there. My approach was to go template.

Being this the case, the old multiplication code was a manually unrolled loop, like a00 * b00 + a01 * b10 etc, while the new one is the more classical and generic "3 nested loops". Now, the generated optimized code is many times faster than the old code, which is cool. The problem is that in debug mode all the loops are kept, no inlining is performed and the resulting code is 6-7 times slower than the old one (which was already slow due to the many cache misses).

I know that debug is normally slower than release, but in order to keep the framerate at an acceptable rate for the others I'd like my code to be a bit faster in debug. I thought I could surround my function with a #pragma optimize, but that would be VS-specific and maybe there are other downsides I'm not aware of. Any suggestion?

I'm seeing if I can use SIMD intrinsics instead, but I'm not sure I'll get to do that due to some issues with our allocator and the 16-bytes alignment.

[ King_DuckZ out-- ]
Advertisement
There's [font="Courier New"]#pragma optimisze[/font]-equivalents for the other compilers. If you need to support multiple compilers in a generic way, you could make a header that contains the platform-specific pragmas in it, so you could write, e.g.#include "push_force_optimize.h"
void foo()
{
}
#include "pop_force_optimize.h"
If you're using Visual Studio one thing that can make a big difference in debug, especially when calling lot of small functions is to turn off the 'basic runtime checks' in the code generation settings.

Of course you lose some of the benefits of debug mode that way, but I can't remember the last time I wrote code that tripped either of those runtime checks.
I tried pragma optimize("gyt", on), as well as "gt" but in both cases I ended up with slower code. Disabling basic runtime checks improved things a lot, it's like 6 times faster. I also found this article, giving some details about the /RTC1 switch (here's something for those unfamiliar with the assembly "rep stosd"). Thanks for helping!
[ King_DuckZ out-- ]
Debug SIMD with MSVC is pretty horrific, it ends up writing all the results to memory for each and every instructions, i assume so the debugger can show you the values more easily. A huge amount of those get turned into registers in optimized builds and never hit real ram, plus the scheduling is all different. Straight up FPU code can beat debug SIMD code in a lot of cases its that messed up.
http://www.gearboxsoftware.com/

Recently I rewrote the matrix class we use here at work. The old one was really a collection of classes, like matrix4x4, matrix4x3 etc, a very long copy and paste with little changes here and there. My approach was to go template.


Don't be so quick to get rid of those classes. If you really want to optimize your code, then the best method is to simply NOT do as much work. A matrix4x3 vs matrix4x4 does exactly that (and since the 4x4 will be doing everything the 4x3 will be doing [+ a little bit more], One would assume you can share code between the classes, with a few additions in the 4x4).


Being this the case, the old multiplication code was a manually unrolled loop, like a00 * b00 + a01 * b10 etc, while the new one is the more classical and generic "3 nested loops". Now, the generated optimized code is many times faster than the old code, which is cool. The problem is that in debug mode all the loops are kept, no inlining is performed and the resulting code is 6-7 times slower than the old one (which was already slow due to the many cache misses).


First thing first. a00 * b00 + a01 * b10 is highly inefficient. For loops aren't any better tbh. Your aim is remove variable dependencies, and use the most linear pattern for memory access as is possible. i.e.

r00 = a00 * b00;
r01 = a00 * b01;
r02 = a00 * b02;
r03 = a00 * b03;

r00 += a01 * b10;
r01 += a01 * b11;
r02 += a01 * b12;
r03 += a01 * b13;

<snip>


Not using SIMD for this is simply another way of murdering polar bears. Save some CO2, use the CPU more efficiently.....

[color="#1C2837"]

I'm seeing if I can use SIMD intrinsics instead, but I'm not sure I'll get to do that due to some issues with our allocator and the 16-bytes alignment.[/quote]
Then do this first:

assert( isAligned(this) && isAligned(&arg0) );

At the start of every matrix / vector method. You'll quickly find all of the locations of mis-aligned structs, and dodgy allocators. Once your program runs without asserting, you can start switching over to SIMD.

This topic is closed to new replies.

Advertisement