• Advertisement
Sign in to follow this  

Count the number of samples that pass the depth test and stencil test

This topic is 432 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

AFAIK, in Directx 11 there are two ways to get the total number of samples that pass the depth test and the stencil test:

 

1. If we use an unordered access view with an internal counter,  we can increase the internal counter for each processed sample and then use a staging buffer to obtain the internal counter.

 

2. We can use hardware occlusion query (D3D11_QUERY_OCCLUSION)

 

But both ways slows down performance, so is there another way to get the total number of samples that pass the depth test and the stencil test?

Share this post


Link to post
Share on other sites
Advertisement

1. If we use an unordered access view with an internal counter,  we can increase the internal counter for each processed sample and then use a staging buffer to obtain the internal counter.

That will not work unless you force early depth stencil; which is incorrect to force in some situations (e.g. if discard is used).

[background=#fafbfc]But both ways slows down performance, so is there another way to get the total number of samples that pass the depth test and the stencil test?[/background]

I suspect the real cause of your slowdown is that you are not waiting a few frames to retrieve the result (i.e. lag the results a few frames behind) which causes the CPU to stall waiting for GPU's results.

Another solution would be to use a 1-channel MRT with separate blending; and use additive blending on the 2nd MRT with SRC_COLOUR and DST_ONE as operands.

Share this post


Link to post
Share on other sites

I suspect the real cause of your slowdown is that you are not waiting a few frames to retrieve the result
So you suggest that the internal counting won't cause noticeable slow down? I always feel that doing counting is expensive when the number is huge. For example in an extreme case like for rendering to a 4k RT, there maybe around 8M serialized counter update (I guess internal counter is doing interlockedAdd), and will that cause serious perf drop? (though I haven't test that...). Or even for a 1080p RT, I feel that counting the updated pixels will have a very bad perf impact (Or the internal counter updating is not totally serialized?)

Another solution would be to use a 1-channel MRT with separate blending; and use additive blending on the 2nd MRT with SRC_COLOUR and DST_ONE as operands.
Do we need to do a reduction after this to get the actual count? or the counter is per pixel?

Share this post


Link to post
Share on other sites

 

I suspect the real cause of your slowdown is that you are not waiting a few frames to retrieve the result

So you suggest that the internal counting won't cause noticeable slow down? I always feel that doing counting is expensive when the number is huge. For example in an extreme case like for rendering to a 4k RT, there maybe around 8M serialized counter update (I guess internal counter is doing interlockedAdd), and will that cause serious perf drop? (though I haven't test that...). Or even for a 1080p RT, I feel that counting the updated pixels will have a very bad perf impact (Or the internal counter updating is not totally serialized?)

 

Well, at 4k everything is expensive.
It's not free, but one interlocked addition with little contention shouldn't be very bad for performance (assuming you use one counter per pixel to minimize contention). If you share one counter for all pixels it's going to run extremely slow.
 

 

Another solution would be to use a 1-channel MRT with separate blending; and use additive blending on the 2nd MRT with SRC_COLOUR and DST_ONE as operands.

Do we need to do a reduction after this to get the actual count? or the counter is per pixel?

 

Yes. You can use a compute shader to perform a parallel sum reduction algorithm then transfer 4 bytes to the CPU, or transfer the whole RTT and perform the sum in the CPU.
What's faster depends on whether you're GPU or CPU bound, and how much PCIE bandwidth you have available.

 

Note however, you will still have to query the results a few frames afterwards (i.e. introduce a delay / lag) to avoid stalls.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement