[background=#fafbfc]Also, most GPUs implement "Hi-Z" or some similar marketing buzzword, which conceptually is like a tiled min-max depth buffer. e.g. for every 8x8 pixel block, there is an auxiliary buffer that stores the smallest and largest depth value inside that region. The fixed-function rasterizer can make use of this information to sometimes very quickly accept (all pixels pass depth test without actually doing depth testing) or reject (all pixels fail depth test and are discarded immediately) whole blocks of pixels.[/background]
I might by wrong here, but isn't Hi-Z about culling whole triangles before they are even passed to the rasterizer? With other words based on the depth values of the three vertices of a triangle?
Both are correct. Hi-Z originally stood for Hierarchical Z; in which the min-max Z values were stored at different granularity: Whole image, then for each of the 4 quadrants, then 16 regions, until we get to blocks made of 8x8 pixels.
Therefore it is possible to reject the whole triangle and reject at 8x8 level.
This is GPU-specific. If a GPU decides to only keep the min-max Z values at only 8x8; or implement part of hierarchy at some levels, or at all levels, is up to the IHV.
- Fixed-function alpha testing is disabled.
I shall add that the presence of discard or clip has the same effect (shader-based alpha testing).
Also related to the UAV's, D3D11 lets you specify the [earlydepthstencil] attribute to force early depth/stencil optimizations to be enabled even if you have writes to UAV's.
Yes. I shall add the GLSL equivalent is
early_fragment_test What I don't understand is this: how does the pipeline judge whether or not to waste resources shading a pixel if the operations with the depth buffer occur in the outputmerger stage?
The pipeline doesn't judge. Before [earlydepthstencil] was introduced, D3D11
mandated Z check must happen after pixel shader execution. But this model split in "stages" is just a set of rules. There is no output merger chip in the modern GPUs (may be there was one in a particular GPU model).
They just have to pretend they follow the rules: The only thing that is important is that the final result must look and act the same as if the rasterization had happened in-order. Executing the Z test before the pixel shader is often a harmless optimization.
Features like the already-mentioned depth out obviously break this assumption (because the depth buffer's value isn't known until it is executed); while others (like alpha testing) does not break the assumption but may not play nice with some internal algorithm or some optimization done in the GPU.
Intuitively I would have thought that something should occur in advance of shading whereby the blend state and the depth buffer are both consulted to determine whether or not to execute the shader.
Because there is no "pixel shader chip" and "output merger chip" since the GeForce 8; stages are just imaginary machinations to make rendering work reliably with a set of rules. The stages don't communicate to each other, but rather the driver looks at all the commands you've been submitting to the command queue, analyzes the current state, decides the best course of action which is hardware- and model-specific, and submits the work to the GPU.
This is why D3D11 is so "slow" compared to new APIs like D3D12. Drivers have to spend a lot of time delaying submission until they get the big picture, then analyze them over and over again every frame. For example in D3D12 rendering is done via PSOs (Pipeline State Objects) where all the information needed (all shaders being used, alpha testing, UAVs set, etc) is known at PSO creation time and the driver can analyze it and set the best course of action
once (at creation time) rather than every frame by delaying with heuristics.