Jump to content
  • Advertisement
Sign in to follow this  
lipsryme

Questions about the render pipeline/gpu optimization

This topic is 2141 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've got 2 specific questions that I'd love to get answered or hear your opinion on.

 

1. The depth test happens "after" pixel shading, but I hear people saying modern GPUs do some kind of depth test "before" that. Does anyone have more information on what that is and what it's doing ? I know about the early-z pass which is faster because you don't use color write and skip the pixel shader. Might that be it ? If so I'd love to hear more details about that process.,,

 

2. Having some knowledge of modern gpu architecture I understand how dynamic branching is really bad. What I don't understand is how this gets better with more recent hardware ? What exactly is it that makes the branching situation (that every thread has to wait for the other threads to finish before continuing) any better ? Is it just that the amount of distributed work load increases and so there's more latency hiding ?

Edited by lipsryme

Share this post


Link to post
Share on other sites
Advertisement

The magic term you are looking for is High-Z or Early-Z (not to be confused with a z-pre-pass). The idea works as follows: Whenever you create and clear a depth buffer, the GPU also allocates an additional low resolution buffer. Every pixel in this smaller depth buffer corresponds to a small square tile in the real depth buffer. Whenever you render s.th. into the depth buffer, this low resolution depth buffer is also updated in such a way, that it always contains the maximal depth of all it's "sub pixels" in the real depth buffer.

 

Now if you render s.th. with depth test enabled, and the pixel shader does not override the default depth value (this is checked when compiling the shader), then the hardware can compute the minimal depth the polygon will have in each of the tiles it touches. This is compared to the values stored in the small depth buffer, to check with a single comparison, if all the pixels of a tile are occluded. If this is the case, all those pixels are discarded without even starting the execution of the pixel shaders.

 

If the comparison with the depth value of the small depth buffer can not guarantee, that all pixels are occluded, or if the pixel shader is writing/changing the pixes' depth values, or if you did anything to invalidate the small depth buffer (switching the depth test mode, rendering with a shader that modifies the depth values, ...), than the pixel shader is executed for all the pixels in the tile and the depth test is performed afterwards on a per pixel basis with the real depth buffer.

Share this post


Link to post
Share on other sites

Thanks guys for the details.

 

 

2. Coherency requirements on early hardware was pretty terrible. On modern hardware your coherency is now a single warp or wavefront, which is a lot easier to work with. Hardware has also just gotten faster at executing branch instructions, so if there's no divergence then you don't really pay much of cost for having the branch in your code.

 

Can you elaborate on that some more ? I'm not that familiar with the terminologies...

Edited by lipsryme

Share this post


Link to post
Share on other sites

"Coherency" describes how often nearby groups of threads take the same branch. "Divergence" is when you have different nearby threads take different branches, and it results in the hardware executing both sides of the branch. GPU's have always had some number of threads where all threads in that group needed to take the same branch in order to avoid divergence. On modern GPU's that number is the warp or wavefront size, which means 32 threads on Nvidia and 64 threads on AMD. On older GPU's these numbers were much higher, which made it difficult to use branching effectively since it was much more likely that you would have divergence.

Share this post


Link to post
Share on other sites

What alternatives are there to discard/alpha test/alpha to coverage? I am getting a HUGE slowdown (57fps to 35fps) rendering a large amount of plants in DX11 (instanced) when I clip() based on alpha (as opposed to rendering all the leafs incorrectly), I suspect it is from hi-z optimizations being disabled..

Share this post


Link to post
Share on other sites
Do you render them last to keep Hi-Z alive as long as possible? You can also try and spend a couple more vertices and make the polygons fit closer to the opaque area of the plants.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!