Prevent redundant constants register writes

Started by
3 comments, last by L. Spiro 11 years, 7 months ago
Is there a way to prevent API calls like SetPixelShaderConstantF() and SetVertexShaderConstantF() from happening if the value written is already there? Also is this a useful optimization and how expensive are many redundant constants register writes between DIP calls?

I was thinking if the device can't do this automatically, it could be done in software if you move away from the Effects framework and use the ": register (cx)" keyword in your shader generation and then scoreboard the register values yourself with an array of floats in C++.
Advertisement
Before optimizing anything, use a profiler to check if this is really a problem. Not seldom an unreasoned optimization will just introduce bugs, bloat the code, even reducing performance.smile.png

Is there a way to prevent API calls like SetPixelShaderConstantF() and SetVertexShaderConstantF() from happening if the value written is already there? Also is this a useful optimization and how expensive are many redundant constants register writes between DIP calls?
Yes it's definitely possible, but some methods for eliminating those redundant calls may be more expensive than the cost of the redundancy! So if it's useful depends on how the problem is solved.
Regarding the cost of these redundant calls -- as usual with GPU work, it depends on a lot of things... First there's just the simple CPU overhead of making a function call into D3D -- this is a small cost, perhaps similar to any virtual function in your own code, not much. Then the rest depends on your GPU/driver; there's a lot of ways that shader constants can be implemented.

1) Older SM2/3 cards, e.g. GeForce 7, may not even support shader constants as a hardware features! These cards implement shader constants by cloning the asm code of your shader and patching the constant values into the asm code as literals. On these cards, it's likely that setting a constant simply writes it into a global buffer, and sets a 'dirty' flag. For each draw-call, if the 'dirty' flag is set, then new GPU-accessible RAM is allocated to store the new shader code (probably in a ring buffer - and if you fill it with too many constant-changes, the CPU will stall until the GPU drains it!), and a new version of the shader is written to that RAM using the appropriate cached constants. Depending on whether this task is carried out by the driver on the CPU, or if the driver sends commands to the GPU's memory controller to do the patching, then each draw-call that uses different constants from the last will have a very high CPU or GPU overhead.
N.B. this may only apply to vertex shaders, or pixel shaders, or both. To optimize for these GPU's, you would definately want to sort by shader program, and by shader constants as a high priority.
Sorting draw-calls by shader-constant values sounds like a ridiculously hard task, so instead, I'd change your rendering API so that you use the abstraction of cbuffers instead -- this greatly simplifies the process of determining if two draw-calls use the same constants.

2) Cards newer than the above ones will support constants in hardware, so the ridiculously huge costs of generating unique shader asm code for different draw-calls disappears. The most simple way for them to send the constants to the GPU registers is to put them in the command buffer, along with the draw calls, so each time you set a constant, the driver is allocating some space in the command buffer (probably a ring buffer again; if you fill it with too many GPU commands per frame, the CPU may stall) and writing a packet such as [font=courier new,courier,monospace]{ID_SET_PS_REGISTER, (short)idxFirstReg, (short)regCount, /*floats * regCount * 4*/ }[/font]. Writing these packets on the CPU will be fairly cheap, it's basically just [font=courier new,courier,monospace]memcpy[/font]ing your inputs to a different destination. Receiving the packets on the GPU should be almost free, as the GPU-front-end that reads the commands will be running in parallel with the GPU units that are actually doing your rasterization and shading, and these latter units should always be the bottleneck.

3) Even newer cards have the option of sending constants through the command buffer, or writing them into resident GPU RAM buffers where appropriate. If we put a lot of faith in the driver, it should be able to figure out the most optimal way of doing this to make it very cheap, so again, the CPU-side cost should be about that of a [font=courier new,courier,monospace]memcpy[/font], and the GPU-cost should be almost nil. However, you should still group as many constants together as possible so that you can set a lot of constants with one call (i.e. using the 3rd parameter of [font=courier new,courier,monospace]SetPixelShaderConstantF[/font]).
Freaking awesome link! I was thinking along the exact same lines, if you're just organized with the allocation of the constants and force them to a specific register number using ": register (cx)" then you can shadow the values in source code and stop the engine from calling SetVertexShaderConstantF based on a simple memcmp.

I have profiled my game many times in PIX, and I see the disturbing trend of repeatedly thrashing the constant registers back and forth as well as pedantically setting all "uniform externs" before each DIP.

I didn't think about sorting by constants register usage, but sorting by vertex shader might be a step in the right direction.

Thanks for the food for thought guys!
My profiling has shown that it always helps to redundancy-check small values such as bool, float, int, up to vec4.
Matrices are better to just send away.
Note that this is probably going to be fairly consistent across the current generation of cards in DirectX, and is not just a hardware issue but a very much a driver issue as well. Because the same cards, running the same scenes, passing the same matrices in OpenGL are always better to redundancy-check, even with larger types such as matrices.

Note also that redundancy-checking in general is a good idea, not just for uniforms, but for textures, vertex buffers, index buffers, shaders, depth-test state, etc.
In DirectX 9 you will definitely want to sort by shader first, followed by textures. This allows you to maximize redundancies, which makes it worth your time to actually do redundancy checks.
I disagree with this being any kind of premature optimization. In general, this is a rule in the world of graphics, which is why tools such as gDEBugger and Xcode OpenGL ES Analysis show performance warnings when redundant states are set.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement