Cost of Switching Shaders

Started by
6 comments, last by Ryan_001 8 years, 2 months ago

1. Why is switching shaders expensive?

2. Is it any faster in DX12? Is there a different logic for 12 vs 11?

also for successive command list submissions in D3D12 with the same PSO is there "redundant PSO filtering"? or does it cost the same for same and different PSO's?

-potential energy is easily made kinetic-

Advertisement

On the CPU side, the "root signature" is changed, which means that (on pre-D3D12 APIs), all the resource bindings must be re-sent to the GPU. The driver/runtime also might have to resubmit a bunch of pipeline state, and even validate that the PS / VS are compatible, etc (and possibly patch them if they mis-match, or patch the VS if it mis-matches with the IA config).... The driver might also have to do things like patch the PS if it doesn't match the current render-target format... :(

D3D12 helps here because you're aware of what's going on now -- you manage your own resource bindings and root signatures, instead of a general purpose runtime trying to guess the best way to manage them for you. It also doesn't do any validation when you bind a resource, so rebinding resources upon root-signature changes is cheaper.

On the GPU side, the front-end has to do a bunch of work to consume all those state/resource changes and get the GPU's cores ready to run with the new configuration. This work is generally pipelined, so the cost is free as long as there's no pipeline stalls... However, if the GPU's cores don't have enough work per draw-call, then they might finish their work before the front-end has configured the next pipeline state, forcing them to sit idle while that happens -- i.e. a pipeline bubble is formed.

On older GPU's I used to use the rule of thumb that a shader change (or other major pipeline state change) was OK as long as every batch covered at least 400 pixels. On modern GPU's I believe you can get away with more frequent switches than that. Some GPUs can even be preparing more than one pipeline config at a time, so you need to have ~8 "small batches" in a row in order to actually cause a pipeline bubble.

On the CPU side, the "root signature" is changed, which means that (on pre-D3D12 APIs), all the resource bindings must be re-sent to the GPU. The driver/runtime also might have to resubmit a bunch of pipeline state, and even validate that the PS / VS are compatible, etc (and possibly patch them if they mis-match, or patch the VS if it mis-matches with the IA config).... The driver might also have to do things like patch the PS if it doesn't match the current render-target format... sad.png

Since you're describing pre-DX12 problems; I shall add that most state changes (particularly shader changes) meant the driver would delay all validation and updates (basically any actual work) until the next DrawPrimitive call. Since it's only then where the driver has already all the information it needs i.e. it needs the IA layout & vertex buffer bindings to patch vertex shaders, it needs the RTT format and multisample settings to patch the pixel shader, etc.

Then it would have to internally create a cache of all the IA Layouts / RTT / shader combinations and pull the ISA assembly code from the cache the next time it is needed.

Mantle said screw it, and came up Pipeline State Objects to condense all that huge information any GPU could possibly need to run generate the ISA from shaders into one huge blob, moving the overhead from DrawPrimitive time (which happens every frame) to PSO creation time (which happens once).

I believe "expensive" is all relative, it all depends on the other choices you have to achieve the aimed goal.

Measure and profile :)

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

Thank you for your answers.


On older GPU's I used to use the rule of thumb that a shader change (or other major pipeline state change) was OK as long as every batch covered at least 400 pixels.

Is there a triangle figure as well?


I believe "expensive" is all relative, it all depends on the other choices you have to achieve the aimed goal.

I am obsessed with front to back rendering but even so switching shaders to accomplish such a goal is most likely to expensive. So I'll likely still sort by shader then depth.

-potential energy is easily made kinetic-

On the CPU side, the "root signature" is changed, which means that (on pre-D3D12 APIs), all the resource bindings must be re-sent to the GPU. The driver/runtime also might have to resubmit a bunch of pipeline state, and even validate that the PS / VS are compatible, etc (and possibly patch them if they mis-match, or patch the VS if it mis-matches with the IA config).... The driver might also have to do things like patch the PS if it doesn't match the current render-target format... sad.png
D3D12 helps here because you're aware of what's going on now -- you manage your own resource bindings and root signatures, instead of a general purpose runtime trying to guess the best way to manage them for you. It also doesn't do any validation when you bind a resource, so rebinding resources upon root-signature changes is cheaper.

On the GPU side, the front-end has to do a bunch of work to consume all those state/resource changes and get the GPU's cores ready to run with the new configuration. This work is generally pipelined, so the cost is free as long as there's no pipeline stalls... However, if the GPU's cores don't have enough work per draw-call, then they might finish their work before the front-end has configured the next pipeline state, forcing them to sit idle while that happens -- i.e. a pipeline bubble is formed.
On older GPU's I used to use the rule of thumb that a shader change (or other major pipeline state change) was OK as long as every batch covered at least 400 pixels. On modern GPU's I believe you can get away with more frequent switches than that. Some GPUs can even be preparing more than one pipeline config at a time, so you need to have ~8 "small batches" in a row in order to actually cause a pipeline bubble.

I wonder... if you were to bundle all those calls into a d3d11 command list. Would (can?) the driver perform most (all?) of those transformations before hand (when the command list is created), and with a command list get some of the optimizations that d3d12 gives with d3d11? Of course you'd need to either created the command list on another thread, or reuse the command list to get any advantage.

I wonder... if you were to bundle all those calls into a d3d11 command list. Would (can?) the driver perform most (all?) of those transformations before hand (when the command list is created), and with a command list get some of the optimizations that d3d12 gives with d3d11? Of course you'd need to either created the command list on another thread, or reuse the command list to get any advantage.

I don't think so. AFAIK, D3D11's cmd list recording is all done in the user-mode library, and doesn't actualy call into the driver until you submit the list.

I don't think so. AFAIK, D3D11's cmd list recording is all done in the user-mode library, and doesn't actualy call into the driver until you submit the list.


It does state in the documentation "Pre-record a command list before you need to render it (for example, while a level is loading) and efficiently play it back later in your scene. This optimization works well when you need to render something often.". I've also seen it stated (though I don't remember exactly where) that the driver can perform some optimizations on the command list, and that the user-mode library is only used when the driver doesn't support it.

This topic is closed to new replies.

Advertisement