FX compiler using more temporary registers than necessary

Started by
8 comments, last by MJP 11 years, 7 months ago
I'm writing a compute shader, and I need to limit the number of temporary registers so I can have more blocks running on each streaming multiprocessor. My target video card is currently NVIDIA compute capability 2.0, which has 48KB of shared memory and 32K (32768) temporary registers (32 bit) per streaming multiprocessor. I have split up the shared memory into 8KB sections, so 6 blocks can exist on a multiprocessor. Each block has 128 threads, resulting in a maximum of 42 temporary registers per block (but directx allocates registers in groups of 4, so I really have a maximum of 40). My shader currently has over 500 instructions (without unrolling any loops). The part of the shader that uses the most registers is inside a for loop that is not unrolled (I have the allow_uav_condition flag enabled for this loop). This code mainly uses variables that are created local to the loop, so once out of the loop, those variables should be dead and the registers should be reused. I then have more code below the for loop (but outside of it) that uses new local variables. When I compile the code with just the for loop, I get exactly 40 temporary registers used (or 10 registers of four 32 bit values). But once I compile with the code below the for loop, I get more than 40 registers. The extra code should only require about 5 registers, but once outside of the for loop, I should've gained at least 15 registers to reuse; therefore, the code shouldn't have exceeded 40 registers. My shader is way too long to go through all the assembly code and write down which register belongs to which variable. Is there a bug/inefficiency in the compiler, or is my logic incorrect? I have tried compiling the code with all the possible combinations of the optimizations flags. With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.

To better visualize, this is the pseudocode:
Code before for loop uses about 25 registers
for(# of iterations)
{
use about 15 newly created local variables inside scope of for loop
}
more code that requires 5 new variables outside of for loop

I have tried reducing the number of global variables by recalculating them instead of storing them, but that doesn't do anything, even though it should.
Advertisement
With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.[/quote]

The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:

tmp = a * b
tmp = tmp * c
tmp = tmp * d

Or do you do it with 2 temp registers:

tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2

Both are three multiplies. The difference is in the second one, the multiplies can overlap. tmp1=a*b and tmp2=c*d have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
I've looked through the assembly code in more detail, and I've found some strange behavior. I have a global counter that the compiler initially stores to r1.y (second 4 32-bit register, y component). It then copies (moves) that value to r1.z. I don't know the reason why, but it then doesn't use r1.y at all for the rest of the program, but it uses r1.z. Why does it waste a register? Really annoying... I think it's doing this sort of thing for my entire shader, bloating the register usage.

With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.


The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:

tmp = a * b
tmp = tmp * c
tmp = tmp * d

Or do you do it with 2 temp registers:

tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2

Both are three multiplies. The difference is in the second one, the multiplies can overlap. tmp1=a*b and tmp2=c*d have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
[/quote]

I'm aware of this; I've even written code to reduce dependency on previous calculations. I've removed most of those code so I use less registers at the expense of waiting a few cycles for a calculation to finish. With the optimizations disabled, it shouldn't be using extra registers to speed up these calculations right?
I've gone through most of the assembly code and have found the following. If I initialize a variable to any thread ID (DispatchThreadID, GroupThreadID, or GroupID), a register will be allocated for that thread ID. If I then want to do arithmetic to that variable, it will copy the thread ID to another register, and use that register for the variable. This is a waste of a register if I only need the thread ID to initialize a variable once. Another thing I've found is that the compiler will allocate a register to store values of calculations that are done repeatedly, even if you don't explicity state a variable for that calculation or even if you explicitly tell it to recalculate the value. For example, if you write the statement if(a>b) multiple times, it will allocate a register to store the result of a>b, even if you don't write bool c = (a > b).

Another example, you write int c = a + b. You use c a few times right after it's created, but then don't need it again until much later in the program. Because a + b is a fast calculation, you're willing to recalculate c in order to free up a register. So later on in the program, you write int d = a + b. Because d = c, the compiler will not calculate d. It will instead store c into a register, and will not use that register for anything else.

The reason for the increase in registers in my shader is that the compiler is not evaluating repetitive lines of code that I purposely put in there to reduce register usage, but rather the compiler is storing the results of those lines of code into a register. Is there a way to force the compiler to not create unnecessary registers? The /Od flag is enabled, so there should be no optimizations, but the compiler is still obviously trying to optimize my code! In my case, code that uses 41 or more registers will result in at least 17% fewer threads being able to run compared to code that uses 40 registers or less.
This is extremely frustrating. What kind of compiler has no setting to compile exactly what the programmer writes? Controlling the number of registers is vital for getting the optimal performance, yet there is no way to force the compiler to stick a certain number of registers. Does anyone know if you get this sort of control in OpenGL and OpenCL? I'm beginning to regret ever learning DirectX. Absolutely worthless documentation and support.
Are you looking at the D3D shader assembly? Because that assembly is just an intermediate format. At runtime the driver will JIT compile it to whatever microcode format is supported by that hardware, which means your assembly is not an accurate reflection of register usage or even which instructions get executed. You need to use vendor-specific tools for that kind of information.
Thanks for the reply. I am looking at the D3D shader assembly that is generated by the fxc.exe compiler. I'm aware that the assembly isn't 100% what the driver would generate at runtime, but how much would it actually differ? If the assembly code isn't similar to the actual code, what would be the purpose of even looking at assembly code then? How would you go about manually controlling the register usage then, or is it just something people don't bother with?
When you're doing a final build for the shaders and you enable the highest optimizations, how does the compiler know how to balance register usage with occupancy? Individual threads may execute faster when more registers are used, but you will have lower occupancy, since each block will require more resources. Does it balance out? Which one do you prefer?
How much it differs depends on the architecture. In my experience the microcode will still retain the overall structure of the shader assembly as far things like stripping dead code, unrolling loops, branches, etc., but there will usually be differences in exactly how it uses registers and how many instructions it needs since the assembly ISA won't match the microcode ISA. In some cases there can be quite a difference between the instructions, especially when you consider that Nvidia and newer AMD hardware only operate in terms of scalar instructions.

Unfortunately register allocation isn't really something you have much control of in DirectX, as I'm sure you're aware of by now. Even if you did, it's hard to make decisions regarding ALU instructions vs. register pressure without knowledge of the actual hardware that your shader will run on. On console platforms I've worked on you have more control over registers since that makes sense for a fixed platform, but as far as PC goes the advice I've seen from IHV's is to "just trust the driver to do the right thing". I couldn't really tell you how the D3D compiler or the driver makes its decisions regarding register allocation, I haven't seen that publicly documented or disclosed anywhere.

This topic is closed to new replies.

Advertisement