Questions about the final stages of the graphics pipeline

Graphics and GPU Programming Programming OpenGL

Started by NathanRidley June 05, 2015 01:23 AM

16 comments, last by NathanRidley 8 years, 10 months ago

52,717

June 06, 2015 12:28 PM

[...]it's instead implemented the depth test as a sorting algorithm!
But for intersecting triangles that doesn't work...or does it somehow?
It works if you sort per pixel, and then only shade the frontmost one.
That's not how PowerVR hardware works though.

It does select the closest triangle per pixel, and then only shades those closest/visible fragments. Meaning, it does do perfect per-pixel hidden surface removal by sorting out the closest surfaces.
Sure, they dont use a generic sorting algorithm -- They first raster all the triangles using a built-in pixel shader that only outputs triangle-ID into a "tag buffer" (using early-Z testing). Then they process every pixel in the whole tile, choosing which trangle to interpolate attributes for at each pixel by reading the tag buffer.

. 22 Racing Series .

Matias Goldberg

9,637

June 06, 2015 04:21 PM

[background=#fafbfc]Also, most GPUs implement "Hi-Z" or some similar marketing buzzword, which conceptually is like a tiled min-max depth buffer. e.g. for every 8x8 pixel block, there is an auxiliary buffer that stores the smallest and largest depth value inside that region. The fixed-function rasterizer can make use of this information to sometimes very quickly accept (all pixels pass depth test without actually doing depth testing) or reject (all pixels fail depth test and are discarded immediately) whole blocks of pixels.[/background]
I might by wrong here, but isn't Hi-Z about culling whole triangles before they are even passed to the rasterizer? With other words based on the depth values of the three vertices of a triangle?

Both are correct. Hi-Z originally stood for Hierarchical Z; in which the min-max Z values were stored at different granularity: Whole image, then for each of the 4 quadrants, then 16 regions, until we get to blocks made of 8x8 pixels.
Therefore it is possible to reject the whole triangle and reject at 8x8 level.
This is GPU-specific. If a GPU decides to only keep the min-max Z values at only 8x8; or implement part of hierarchy at some levels, or at all levels, is up to the IHV.

Fixed-function alpha testing is disabled.

I shall add that the presence of discard or clip has the same effect (shader-based alpha testing).

Also related to the UAV's, D3D11 lets you specify the [earlydepthstencil] attribute to force early depth/stencil optimizations to be enabled even if you have writes to UAV's.

Yes. I shall add the GLSL equivalent is early_fragment_test

What I don't understand is this: how does the pipeline judge whether or not to waste resources shading a pixel if the operations with the depth buffer occur in the outputmerger stage?

The pipeline doesn't judge. Before [earlydepthstencil] was introduced, D3D11 mandated Z check must happen after pixel shader execution. But this model split in "stages" is just a set of rules. There is no output merger chip in the modern GPUs (may be there was one in a particular GPU model).
They just have to pretend they follow the rules: The only thing that is important is that the final result must look and act the same as if the rasterization had happened in-order. Executing the Z test before the pixel shader is often a harmless optimization.
Features like the already-mentioned depth out obviously break this assumption (because the depth buffer's value isn't known until it is executed); while others (like alpha testing) does not break the assumption but may not play nice with some internal algorithm or some optimization done in the GPU.

Intuitively I would have thought that something should occur in advance of shading whereby the blend state and the depth buffer are both consulted to determine whether or not to execute the shader.

Because there is no "pixel shader chip" and "output merger chip" since the GeForce 8; stages are just imaginary machinations to make rendering work reliably with a set of rules. The stages don't communicate to each other, but rather the driver looks at all the commands you've been submitting to the command queue, analyzes the current state, decides the best course of action which is hardware- and model-specific, and submits the work to the GPU.
This is why D3D11 is so "slow" compared to new APIs like D3D12. Drivers have to spend a lot of time delaying submission until they get the big picture, then analyze them over and over again every frame. For example in D3D12 rendering is done via PSOs (Pipeline State Objects) where all the information needed (all shaders being used, alpha testing, UAVs set, etc) is known at PSO creation time and the driver can analyze it and set the best course of action once (at creation time) rather than every frame by delaying with heuristics.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

Krypt0n

4,769

June 07, 2015 09:08 AM

[...][background=#fafbfc]it's instead implemented the depth test as a sorting algorithm![/background]
But for intersecting triangles that doesn't work...or does it somehow?
It works if you sort per pixel, and then only shade the frontmost one.
That's not how PowerVR hardware works though.
It does select the closest triangle per pixel, and then only shades those closest/visible fragments. Meaning, it does do perfect per-pixel hidden surface removal by sorting out the closest surfaces.
Sure, they dont use a generic sorting algorithm -- They first raster all the triangles using a built-in pixel shader that only outputs triangle-ID into a "tag buffer" (using early-Z testing). Then they process every pixel in the whole tile, choosing which trangle to interpolate attributes for at each pixel by reading the tag buffer.

there is no build in pixel shader, it's all custom hardware ;)

They describe in details how TBDR works:
http://imgtec.eetrend.com/sites/imgtec.eetrend.com/files/download/201402/1458-2110-1385011428.pdf
Also explains why Alphatest/Clip/Kill breaks the idea etc.

if they'd extend it, it would be an awesome g-buffer-deferred-shading architecture

video game porting and optimization service + consulting

Hodgman

52,717

June 07, 2015 09:27 AM

Because there is no "pixel shader chip" and "output merger chip" since the GeForce 8;

There's usually some bit of hardware that the shader cores 'export' pixels to memory via, still. I personally still call them the ROP even though they don't go by that name any more

I think AMD calls them the color-block and depth-block nowadays.

there is no build in pixel shader, it's all custom hardware

Are we talking physical or logical pixel shaders

if they'd extend it, it would be an awesome g-buffer-deferred-shading architecture

I had a quick google just now (because of this thread) and came across this paper from Intel: The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading.
I've only just quickly skimmed it, but it seems that they're talking about ways we could use PowerVR's tag buffer idea on PC GPUs to help speed up our deferred renderers.
It also looks like they sabotaged their traditional gbuffer implementation to exaggerate their results though

. 22 Racing Series .

Krypt0n

4,769

June 07, 2015 08:24 PM

I had a quick google just now (because of this thread) and came across this paper from Intel: The Visibility Buffer: A Cache-Friendly Approach to Deferred Shading.
I've only just quickly skimmed it, but it seems that they're talking about ways we could use PowerVR's tag buffer idea on PC GPUs to help speed up our deferred renderers.
It also looks like they sabotaged their traditional gbuffer implementation to exaggerate their results though

I had a traditional gbuffer in mind. one of the issues of gbuffers is that those quickly become fat and can suck up quite a lot of bandwidth.

however, PowerVR is very bandwidth friendly, as they just write out the final rendertarget(s).

extending that: instead of writing out the g-buffer render target to use those as input textures for deferred passes, you could actually do on-chip reading of the g-buffer(-tile), do your shading passes (classical or the compute way) and just write out the final HDR target (as you still need to do your bloom passes etc. which cannot run on those 32x32 tiles.)

somehow that sounds damn sexy to me, even with super sampling you would just need to write out the resolved HDR target... the more I think about it, the more I love it... sony should sponsore me a vita devkit, I bet they could make it happen

video game porting and optimization service + consulting

Xycaleth

2,391

June 07, 2015 10:36 PM

extending that: instead of writing out the g-buffer render target to use those as input textures for deferred passes, you could actually do on-chip reading of the g-buffer(-tile), do your shading passes (classical or the compute way) and just write out the final HDR target (as you still need to do your bloom passes etc. which cannot run on those 32x32 tiles.)

It sounds like you're describing the OpenGL ES extension, shader pixel local storage. ARM have a blog post describing how you can use this extension for a very efficient gbuffer implementation, and this is also possible on PowerVR hardware.

Krypt0n

4,769

June 08, 2015 06:59 AM

sorry for hijacking this thread further, but does anyone have a powervr chip supporting that extension? (glview does not show it at my qualcomm gpu :( )

video game porting and optimization service + consulting

NathanRidley

1,092

Author

June 08, 2015 07:12 AM

Questions about the final stages of the graphics pipeline

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Questions about the final stages of the graphics pipeline

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines