draw call order and binding costs - dx9 shaders

Started by
8 comments, last by Hodgman 7 years, 3 months ago

I'm beginning to add shader based capabilities to my wrapper API for dx9.0c.

at first, i can get away with shaders for special effects (instancing, swaying grass, etc), and fixed function for the things its capable of (gourard and phong, alphatest and alpha blend, 2 stage texture blending, etc).

but if i eventually go to 100% shader based, this would likely call for a re-design of the render queue.

in fixed function, textures have the highest binding cost. so my render queue sorts on texture. to date that has been fast enough that i don't even bother with sorting on mesh, distance, etc. but that's just with placeholder graphics.

so looking forward to a time when i'm 100% shader based, what are the binding costs i'm looking at? where does setting vertex and pixel shaders and constants fall in the list of binding texture, input streams, shader constants, input formats, pixel shader, vertex shader, etc ? materials will be replaced by pixel shaders (right?) - so no binding costs there.

i'll want single purpose pre-compiled shaders, and i'll want to treat them as a shared graphics resource, in a memory pool, with state management, right? perhaps with shared memory pools of constants and formats as well?

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

Advertisement

D3D9 effectively has to validate the pipeline state before anything affecting the drawing is changed. Shaders are relatively expensive to validate, as the driver does internal microcode compilation and optimization before running the shader code, and checks that the input (and output) streams make sense against the shaders.

A good practice would be to sort the state changes by shaders, textures and vertex input streams.

Some D3D9 cards are limited on the available shader instruction count and performance, so you need to find a balance between the actual drawing performance and the batching performance. On more modern cards (D3D10 and up) it is much more feasible to define so-called "über-shaders" that don't need to be changed very often.

Niko Suni

Internally, changing render states (alpha test / etc) amd changing render-targets can be anything from writing a few bytes to internally patching the shader (note thay even FFP is using shaders internally - there is no FFP hardware any more, there are just shader cores).

Changing shaders requires quite a bit of work, and may even internally rebind your resources.

Binding resources causes a lot of internal reference counting and fencing to occur.

Binding constants is pretty cheap.

On the GPU side, changing render targets usually requires a cache flush.

The GPU will internally try to run multiple draw calls at once, but changing the shader program or some pipeline states can insert a kind of fence that prevents this.
Changing resources (texture bindings) can also prevent this.
You want to submit draws that use the same textures next to each other in order to leverage the texture cache - if data is still in the cache after the first draw, the second draw may get cache hits.

Modern GPUs have early depth testing (before pixel shading), so the GPU prefers front to back opaque drawing.

Overall, I'd sort by render target, then opaque/alpha test/alpha blend, then coarse depth sorting (very rough front to back - maybe a dozen to a few hundred buckets), then shader and render state, then by resources, then by constants.

Overall, I'd sort by render target, then opaque/alpha test/alpha blend, then coarse depth sorting (very rough front to back - maybe a dozen to a few hundred buckets), then shader and render state, then by resources, then by constants.

Is this specific to DX9? Most people seem to suggest to sort by shader before depth, at least from what I've seen.

-potential energy is easily made kinetic-

sort the state changes by shaders, textures and vertex input streams.

vertex and pixel shader binding costs are similar?

i suspected shader binding would be expensive.

and conditional branching in uber shaders can be less costly than two smaller single effect shaders?

so i'd want to do something like bind the standard transform vertex shader, then do passes for each pixel shader effect, such as mip mapped, alpha test, alpha blend, 2 stage tex blend, and non-mip mapped (if i even use that) ?

and on each pixel shader pass, i draw in texture, mesh, near to far order, right ?

or write one or more uber shaders that combine two or more passes?

On more modern cards (D3D10 and up) it is much more feasible to define so-called "über-shaders" that don't need to be changed very often.

while i'm using dx9. odds are the game will require at least a dx10 capable card. but i'm still limited to version 3 shader code and the limitations on number of constants.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

, I'd sort by render target,

fortunately, i never change render target. : )

then opaque/alpha test/alpha blend,

as one uber pixel shader, or three separate shaders? one uber, right? the tests are similar. are the inputs the same? i forget, is there alpha data in an opaque texture? r8g8b8a8, yeah i guess there is, so inputs for all three effects would include alpha, so same API (inputs/params) for all three effects in one uber shader? what is that? x y z from the vertex shader, and r g b and a from the texture? oh - and lite direction and intensity as constants... god its been forever since i did graphics code!

it looking like for something like caveman, i'd have a standard transform vertex shader, and a wind effects vertex shader. and i'd have opaque, alphatest, alphablend, and 2 stage tex blend as pixel shaders. just six in all?

what about mip-mapped vs non-mip-mapped? different pixel shader code for each version, right? i use both mipmaped and non-mipmapped textures.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

vertex and pixel shader binding costs are similar?


Yes, in that changing either costs.
The GPU is not a single program state machine; when work is dispatched it will be a bundle of commands sent together which will likely contain both vertex and pixel shaders to use to dispatch the work - there is a reason that DX12/Vulkan encode a lot of stuff in to a single structure after all.

On the driver side changing state is initially cheap; the cost comes when the draw is kicked and the driver has to figure out wtf has changed via hashing etc and then build the commands to send. Chances are changing both has the same cost, there or there abouts, of changing one in that case.

Systems I've worked on bundle vertex and pixel shaders together, using hashing/names to reduce the API object count at run time (for example, if two materials use the same vertex shader then only one is created, but both reference it) - however a material is considered a discreet thing so if you have one object which uses Foo and another which uses Bar then even if they share a shader they are different. Now, the graphics API itself (talking DX9/DX11 here) will maintain a copy of the state so that rebinding of shaders doesn't happen (so if the material shaders any shaders you won't redundantly rebind) - but a material itself is a combination of shaders, textures and constants arranged in to passes etc as previously mentioned.

and conditional branching in uber shaders can be less costly than two smaller single effect shaders?


Maybe.
It depends somewhat on the hardware, somewhat on the driver and somewhat on how it is used; if you have zero to little divergent flow (some of your pixels go one way, some the other) then it can be a win.
However the flip side of this is that Ubershaders have to assume you will take both paths, which brings up the daemon of 'register pressure' - a GPU only has so many registers it can allocate to running threads; the more your shaders use the less wavefronts/warps the hardware can keep in flight and the worse your performance can be.

Branching causes a problem because lets say the GPU has 40 registers to keep work in flight (real hardware has many more but run with it) - if your shader takes 4 registers to execute then we can run 10 instances at once. If it takes 5 then we are down to 8, 6 leads to 6 and so on (always round down). Now, lets say that your shader has a branch with two paths to it - one requires 4 registers, the other requires 3 - the compiler will produce code which requires 7 registers to run which are statically allocated by the GPU when execution begins for that shader - now you have at most 5 instances running (40/7 = 5.7, and we round down) for any draw calls using that shader. If, however, you had two shaders then you would get 10 and 13 instances depending on which shader you took.

So while you might write an ubershader, it can often be much better to compile two version of it and select at runtime to get maximal throughput on the device.

then opaque/alpha test/alpha blend,


as one uber pixel shader, or three separate shaders? one uber, right? the tests are similar. are the inputs the same? i forget, is there alpha data in an opaque texture? r8g8b8a8, yeah i guess there is, so inputs for all three effects would include alpha, so same API (inputs/params) for all three effects in one uber shader? what is that? x y z from the vertex shader, and r g b and a from the texture? oh - and lite direction and intensity as constants... god its been forever since i did graphics code!


Those are not shaders; those are device states.
Shaders might interact but that's a separate sort.

I also feel your concepts of 'pass' and 'ubershader' are wrong here;
- A 'pass' is a distinct group of draw calls which lay down some information such as a depth pass, or a colour pass - you wouldn't do multiple passes in a single draw call/shader invoke.
- Ubershaders are a combination of potential shader paths; previously you might write a shader which does a single texture fetch, then another which does two, or three - then at draw time depending on what you are doing you'd select the correct shader to run. With an Ubershader you write a shader which does all of the above and then use variables, either constants or sourced from a buffer, to decide what you are doing to do much like a normal program. Passes etc don't factor in to it.

what about mip-mapped vs non-mip-mapped? different pixel shader code for each version, right? i use both mipmaped and non-mipmapped textures.


No, those are texture states and have no impact on the kinds of shaders you'd be writing at this basic level.

Honestly, I think you need to slow down and go and read some stuff on APIs from the passed 15 years to catch up your knowledge a bit as it'll make more sense to you and you are less likely to get lost in terms and concepts you don't understand.

L. Spiro talked about it back in 2011.

The thing is, I don't know how out of date it is. How DX9 abstracts the HW is very distant from how modern GPUs works nowadays.

The question is "are you supporting DX9 to support Win XP and run in modern GPUs? or are you supporting DX9 to support old cards while you use DX11+ to run in modern GPUs?"

Because the answer to the sorting problem can vary based on how you answer this question. Older cards have a higher penalty for swapping textures than modern cards do, for example. In newer cards, changing the vertex format means the driver will internally swap a different shader.

Everyone agrees RenderTarget comes first, and that changing uniforms is cheap. But are textures more expensive than swapping shaders or changing vertex formats? Mmm... good question.

On top of this, there's a lot of CPU-side validation going on.

The best answer I can give is "just try it". Run your system on the target GPUs you want to support and swap different rendering orders to see which one performs best on your general case.

But are textures more expensive than swapping shaders or changing vertex formats? Mmm... good question.
There's also the question of CPU cost VS GPU cost.

e.g. perhaps swapping textures takes less CPU time than swapping shaders... but perhaps swapping shaders has a bigger impact on the GPU frametime!

Is this specific to DX9? Most people seem to suggest to sort by shader before depth, at least from what I've seen.
It depends how much you care about CPU time VS GPU time when performing these optimizations.

Sorting by coarse front-to-back depth doesn't help your CPU time at all -- it probably hurts it by requiring more state changes overall... but can help on the GPU by getting more overdraw to be rejected by early-Z testing. If your sorting is too fine grained, then you'll start to hurt the GPU by excessive shader swapping / state changes, which means the GPU may have to stall between successive draws in order to reconfigure itself. If your sorting is too coarse, then you don't actually stop any overdraw from occurring.

e.g. if you sort by shader, and then by depth within each shader-grouping, it's possible that you don't stop any overdraw. Perhaps shader-A is used on the background and shader-B is used on the foreground. Sorting by shader means you draw A first and then B, which means you're drawing background before foreground. Doing a very coarse depth sort before sorting by shader is a trade-off to make sure that you get decent value out of early-Z testing.

This topic is closed to new replies.

Advertisement