• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
NotTakenSN

FX compiler using more temporary registers than necessary

9 posts in this topic

I'm writing a compute shader, and I need to limit the number of temporary registers so I can have more blocks running on each streaming multiprocessor. My target video card is currently NVIDIA compute capability 2.0, which has 48KB of shared memory and 32K (32768) temporary registers (32 bit) per streaming multiprocessor. I have split up the shared memory into 8KB sections, so 6 blocks can exist on a multiprocessor. Each block has 128 threads, resulting in a maximum of 42 temporary registers per block (but directx allocates registers in groups of 4, so I really have a maximum of 40). My shader currently has over 500 instructions (without unrolling any loops). The part of the shader that uses the most registers is inside a for loop that is not unrolled (I have the allow_uav_condition flag enabled for this loop). This code mainly uses variables that are created local to the loop, so once out of the loop, those variables should be dead and the registers should be reused. I then have more code below the for loop (but outside of it) that uses new local variables. When I compile the code with just the for loop, I get exactly 40 temporary registers used (or 10 registers of four 32 bit values). But once I compile with the code below the for loop, I get more than 40 registers. The extra code should only require about 5 registers, but once outside of the for loop, I should've gained at least 15 registers to reuse; therefore, the code shouldn't have exceeded 40 registers. My shader is way too long to go through all the assembly code and write down which register belongs to which variable. Is there a bug/inefficiency in the compiler, or is my logic incorrect? I have tried compiling the code with all the possible combinations of the optimizations flags. With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.

To better visualize, this is the pseudocode:
Code before for loop uses about 25 registers
for(# of iterations)
{
use about 15 newly created local variables inside scope of for loop
}
more code that requires 5 new variables outside of for loop

I have tried reducing the number of global variables by recalculating them instead of storing them, but that doesn't do anything, even though it should.
0

Share this post


Link to post
Share on other sites
[quote]With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.[/quote]

The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:

tmp = a * b
tmp = tmp * c
tmp = tmp * d

Or do you do it with 2 temp registers:

tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2

Both are three multiplies. The difference is in the second one, the multiplies can overlap. [b] tmp1=a*b[/b] and[b] tmp2=c*d[/b] have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
2

Share this post


Link to post
Share on other sites
I've looked through the assembly code in more detail, and I've found some strange behavior. I have a global counter that the compiler initially stores to r1.y (second 4 32-bit register, y component). It then copies (moves) that value to r1.z. I don't know the reason why, but it then doesn't use r1.y at all for the rest of the program, but it uses r1.z. Why does it waste a register? Really annoying... I think it's doing this sort of thing for my entire shader, bloating the register usage.
0

Share this post


Link to post
Share on other sites
[quote name='DracoLacertae' timestamp='1347569964' post='4979833']
[quote]With the /Od flag (optimizations disabled), I get the least number of temporary registers used. Without /Od or with /O0 or /O1, I get 50% more registers used.[/quote]

The extra registers are probably being used to speed things along. If the arch of the shader units are anything like the arch of any modern cpu, the instructions overlap each other on execution. Suppose you have an expression like a*b * c*d. You can do that with 1 temp register:

tmp = a * b
tmp = tmp * c
tmp = tmp * d

Or do you do it with 2 temp registers:

tmp = a*b
tmp2 = c*d
tmp = tmp * tmp2

Both are three multiplies. The difference is in the second one, the multiplies can overlap. [b] tmp1=a*b[/b] and[b] tmp2=c*d[/b] have no dependencies on each other, so they can happen at the same time. The price you pay for these find of speed optimizations is extra registers being used.
[/quote]

I'm aware of this; I've even written code to reduce dependency on previous calculations. I've removed most of those code so I use less registers at the expense of waiting a few cycles for a calculation to finish. With the optimizations disabled, it shouldn't be using extra registers to speed up these calculations right?
0

Share this post


Link to post
Share on other sites
I've gone through most of the assembly code and have found the following. If I initialize a variable to any thread ID (DispatchThreadID, GroupThreadID, or GroupID), a register will be allocated for that thread ID. If I then want to do arithmetic to that variable, it will copy the thread ID to another register, and use that register for the variable. This is a waste of a register if I only need the thread ID to initialize a variable once. Another thing I've found is that the compiler will allocate a register to store values of calculations that are done repeatedly, even if you don't explicity state a variable for that calculation or even if you explicitly tell it to recalculate the value. For example, if you write the statement if(a>b) multiple times, it will allocate a register to store the result of a>b, even if you don't write bool c = (a > b).

Another example, you write int c = a + b. You use c a few times right after it's created, but then don't need it again until much later in the program. Because a + b is a fast calculation, you're willing to recalculate c in order to free up a register. So later on in the program, you write int d = a + b. Because d = c, the compiler will not calculate d. It will instead store c into a register, and will not use that register for anything else.

The reason for the increase in registers in my shader is that the compiler is not evaluating repetitive lines of code that I purposely put in there to reduce register usage, but rather the compiler is storing the results of those lines of code into a register. Is there a way to force the compiler to not create unnecessary registers? The /Od flag is enabled, so there should be no optimizations, but the compiler is still obviously trying to optimize my code! In my case, code that uses 41 or more registers will result in at least 17% fewer threads being able to run compared to code that uses 40 registers or less. Edited by NotTakenSN
0

Share this post


Link to post
Share on other sites
This is extremely frustrating. What kind of compiler has no setting to compile exactly what the programmer writes? Controlling the number of registers is vital for getting the optimal performance, yet there is no way to force the compiler to stick a certain number of registers. Does anyone know if you get this sort of control in OpenGL and OpenCL? I'm beginning to regret ever learning DirectX. Absolutely worthless documentation and support.
0

Share this post


Link to post
Share on other sites
Are you looking at the D3D shader assembly? Because that assembly is just an intermediate format. At runtime the driver will JIT compile it to whatever microcode format is supported by that hardware, which means your assembly is not an accurate reflection of register usage or even which instructions get executed. You need to use vendor-specific tools for that kind of information.
1

Share this post


Link to post
Share on other sites
Thanks for the reply. I am looking at the D3D shader assembly that is generated by the fxc.exe compiler. I'm aware that the assembly isn't 100% what the driver would generate at runtime, but how much would it actually differ? If the assembly code isn't similar to the actual code, what would be the purpose of even looking at assembly code then? How would you go about manually controlling the register usage then, or is it just something people don't bother with?
0

Share this post


Link to post
Share on other sites
When you're doing a final build for the shaders and you enable the highest optimizations, how does the compiler know how to balance register usage with occupancy? Individual threads may execute faster when more registers are used, but you will have lower occupancy, since each block will require more resources. Does it balance out? Which one do you prefer?
0

Share this post


Link to post
Share on other sites
How much it differs depends on the architecture. In my experience the microcode will still retain the overall structure of the shader assembly as far things like stripping dead code, unrolling loops, branches, etc., but there will usually be differences in exactly how it uses registers and how many instructions it needs since the assembly ISA won't match the microcode ISA. In some cases there can be quite a difference between the instructions, especially when you consider that Nvidia and newer AMD hardware only operate in terms of scalar instructions.

Unfortunately register allocation isn't really something you have much control of in DirectX, as I'm sure you're aware of by now. Even if you did, it's hard to make decisions regarding ALU instructions vs. register pressure without knowledge of the actual hardware that your shader will run on. On console platforms I've worked on you have more control over registers since that makes sense for a fixed platform, but as far as PC goes the advice I've seen from IHV's is to "just trust the driver to do the right thing". I couldn't really tell you how the D3D compiler or the driver makes its decisions regarding register allocation, I haven't seen that publicly documented or disclosed anywhere. Edited by MJP
1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0