FX compiler using more temporary registers than necessary
#1 Members - Reputation: 149
Posted 13 September 2012 - 02:22 PM
To better visualize, this is the pseudocode:
Code before for loop uses about 25 registers
for(# of iterations)
{
use about 15 newly created local variables inside scope of for loop
}
more code that requires 5 new variables outside of for loop
I have tried reducing the number of global variables by recalculating them instead of storing them, but that doesn't do anything, even though it should.
#2 Members - Reputation: 405
Posted 13 September 2012 - 02:59 PM
With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.
The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:
tmp = a * b
tmp = tmp * c
tmp = tmp * d
Or do you do it with 2 temp registers:
tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2
Both are three multiplies. The difference is in the second one, the multiplies can overlap. tmp1=a*b and tmp2=c*d have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
#3 Members - Reputation: 149
Posted 13 September 2012 - 03:04 PM
#4 Members - Reputation: 149
Posted 13 September 2012 - 03:09 PM
With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.
The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:
tmp = a * b
tmp = tmp * c
tmp = tmp * d
Or do you do it with 2 temp registers:
tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2
Both are three multiplies. The difference is in the second one, the multiplies can overlap. tmp1=a*b and tmp2=c*d have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
I'm aware of this; I've even written code to reduce dependency on previous calculations. I've removed most of those code so I use less registers at the expense of waiting a few cycles for a calculation to finish. With the optimizations disabled, it shouldn't be using extra registers to speed up these calculations right?
#5 Members - Reputation: 149
Posted 13 September 2012 - 04:46 PM
Another example, you write int c = a + b. You use c a few times right after it's created, but then don't need it again until much later in the program. Because a + b is a fast calculation, you're willing to recalculate c in order to free up a register. So later on in the program, you write int d = a + b. Because d = c, the compiler will not calculate d. It will instead store c into a register, and will not use that register for anything else.
The reason for the increase in registers in my shader is that the compiler is not evaluating repetitive lines of code that I purposely put in there to reduce register usage, but rather the compiler is storing the results of those lines of code into a register. Is there a way to force the compiler to not create unnecessary registers? The /Od flag is enabled, so there should be no optimizations, but the compiler is still obviously trying to optimize my code! In my case, code that uses 41 or more registers will result in at least 17% fewer threads being able to run compared to code that uses 40 registers or less.
Edited by NotTakenSN, 13 September 2012 - 04:50 PM.
#6 Members - Reputation: 149
Posted 13 September 2012 - 07:40 PM
#7 Moderators - Reputation: 5642
Posted 13 September 2012 - 08:20 PM
#8 Members - Reputation: 149
Posted 13 September 2012 - 09:10 PM
#9 Members - Reputation: 149
Posted 13 September 2012 - 09:20 PM
#10 Moderators - Reputation: 5642
Posted 14 September 2012 - 12:35 PM
Unfortunately register allocation isn't really something you have much control of in DirectX, as I'm sure you're aware of by now. Even if you did, it's hard to make decisions regarding ALU instructions vs. register pressure without knowledge of the actual hardware that your shader will run on. On console platforms I've worked on you have more control over registers since that makes sense for a fixed platform, but as far as PC goes the advice I've seen from IHV's is to "just trust the driver to do the right thing". I couldn't really tell you how the D3D compiler or the driver makes its decisions regarding register allocation, I haven't seen that publicly documented or disclosed anywhere.
Edited by MJP, 14 September 2012 - 12:36 PM.






