Overhead of transiting state of resource which are not touched

Started by
8 comments, last by Mr_Fox 7 years, 3 months ago
Hi Guys,
From my understanding, transit state of resource A is telling GPU to flush related cache (is it just the cache pointed to A or entire cache?) to ram, so will wait until all write (or just write to A) finished. And beyond that, it will responsible for decode/encode the resource.
So I was wondering: if my transition will not trigger compression/decompression, and the resource I am transiting is not touched at all since last transition(for example, for a conditional pass which is not executed sometimes), will I still get the overhead of transition resource? My guess is that this transition should only change some flag bits on GPU (or on driver end) so shouldn't cause any GPU wait time....
It will be great if someone could talk about how the transition works on GPU.
Thanks
P.S. I first posted this on Directxtech forum, but didn't get any reply for a long time, so I copied it here...
Advertisement

How do you know your transition won't trigger a de/compression? Future GPU models may add compression to something that previously never used it.

DX12 is an explicit API. It's not its job to figure what are you trying to do.

If you're not going to use the resource, then do not transition it unnecessarily. Or just do it, potentially paying the price.

Btw what's the previous state and the new one? You're omitting quite the most important piece of information

How do you know your transition won't trigger a de/compression? Future GPU models may add compression to something that previously never used it.

DX12 is an explicit API. It's not its job to figure what are you trying to do.

If you're not going to use the resource, then do not transition it unnecessarily. Or just do it, potentially paying the price.

Btw what's the previous state and the new one? You're omitting quite the most important piece of information

Sorry for such a vague question. I have this question from one of my project where an dispatchindirect compute shader will defragment a huge buffer under UAV state (so before and after state are both UAV, but you need a transition UAV->UAV as barrier before and after this shader), however, this dispatchindirect call will launch 0 job 90% of the time, and that information is generated inside GPU and I don't want to pay the cost read it back to CPU and explicit call defragment cs based on condition. So I was concerned about the cost of these two UAV barriers when dispatchindirect has zero workload.....

Also I feel the same case may happened in non-UAV state situation when you call XXXIndirect and get no thread spawned based on GPU data (for example RT->SRV DrawIndirect SRV->RT and DrawIndirect has no zero vertex count, how much cost happened here? RT->SRV involve decompression) So I wish to know how GPU works on those cases? Does GPU know each resource being touched or not, and use that piece of info to 'optimize' certain process? If GPU is that 'smart' I will then be much more relieved to put those potential zero workload XXXIndirect and related transition on my frame routine, otherwise I will be doomed...

What time difference do you get if you profile once with and once without (zero dispatch) your defragmentation.

Further you can modify your defragmentation shader to do nothing and repeat the test with and without the transitions.

This way you should be able to get all answers about performance, or am i wrong? I'm curious about the results :)

What time difference do you get if you profile once with and once without (zero dispatch) your defragmentation.

Further you can modify your defragmentation shader to do nothing and repeat the test with and without the transitions.

This way you should be able to get all answers about performance, or am i wrong? I'm curious about the results :)

I did profile like you suggest :) but the result is hard to reason about: (on GTX680m)

In all test case the buf defragment shader will operate on a buffer with size of 200k * 32bit. My GPU profiler report that defragment cs (zero workload) will take 0.01ms including all related resource state transition (by inserting timestamp in cmdlist's proper place). However the AB test with/without defragment CS I get stable 0.4ms difference...

So the result from AB test surely discourage the use of potential zero workload XXXindirect. But I wish to know where the cost coming from: mainly from transitioning resource? or XXXindirect call routine?

But I will say that my test GPU is kinda old, and my GTX1080 GPU won't work with my profiler (see this post), so the result may differ for newer GPUs...

I don't suppose it's possible to do a predicated transition, like a predicated draw?

I don't suppose it's possible to do a predicated transition, like a predicated draw?

I just wondering how GPU actually 'transitioning' the resource: in the case where no compression/decompression being triggered, is transition command just to make sure all shader that touched THAT BUFFER is finished and all write TO THAT BUFFER is applied to RAM? If that is true then I guess the transitioning around empty XXXindirect will have almost no cost since there is no shader in flight touched that buffer, and there is no write to that buffer too. But my guess could be totally wrong...

What is an "AB test"?

0.01ms is what i've expected. For me dozens of zero dispatches with barriers sum up to 0.1ms, but i don't use any transitions yet.

0.4ms seems much too much. Maybe you can use split barriers and do some other work in between to make the transition more efficient, or do the whole thing earlier / later affects performance.

There's Pascal support now in Nsight: https://developer.nvidia.com/nsight-visual-studio-edition-52-released-oculus-and-vulkan-support-direct3d-12-profiling-and-3?mkt_tok=eyJpIjoiT1dSaU56SXhaak0zTVRNeiIsInQiOiI5d1V3RnFDWEJUN1Bia0lPNFBvU3d2KzFtU1psakhlOGVZUkRMOWhOUUM3bnV4YWtFRjM4SWs5S3dITWNYWE0xT3VsNlkxU1pSeWM5N2lTNzdZa0dqXC9XU0E1ZXhDVG1PUXFFQ2VZOEpCWENpZlF2TEZXcld0czJsd2VIdTJnTWgifQ%3D%3D

Also, do you need a profiling tool at all? You can analyse the timestamps on your own.

I just wondering how GPU actually 'transitioning' the resource: in the case where no compression/decompression being triggered, is transition command just to make sure all shader that touched THAT BUFFER is finished and all write TO THAT BUFFER is applied to RAM? If that is true then I guess the transitioning around empty XXXindirect will have almost no cost since there is no shader in flight touched that buffer, and there is no write to that buffer too. But my guess could be totally wrong...

Some GPUs that I've worked with only have commands to flush and/or invalidate an entire cache. Other GPU's have commands to flush/invalidate particular address regions... So it depends on the GPU.
Even if you get an answer that you like for current GPU's though, a new one might come out tomorrow that works differently.
There's also the CPU-side overhead of tracking resource states.

If you require this transition for your program to be well formed, then just put it in there. Come back to it when it shows up on the profiler :P

What is an "AB test"?

0.01ms is what i've expected. For me dozens of zero dispatches with barriers sum up to 0.1ms, but i don't use any transitions yet.

0.4ms seems much too much. Maybe you can use split barriers and do some other work in between to make the transition more efficient, or do the whole thing earlier / later affects performance.

There's Pascal support now in Nsight: https://developer.nvidia.com/nsight-visual-studio-edition-52-released-oculus-and-vulkan-support-direct3d-12-profiling-and-3?mkt_tok=eyJpIjoiT1dSaU56SXhaak0zTVRNeiIsInQiOiI5d1V3RnFDWEJUN1Bia0lPNFBvU3d2KzFtU1psakhlOGVZUkRMOWhOUUM3bnV4YWtFRjM4SWs5S3dITWNYWE0xT3VsNlkxU1pSeWM5N2lTNzdZa0dqXC9XU0E1ZXhDVG1PUXFFQ2VZOEpCWENpZlF2TEZXcld0czJsd2VIdTJnTWgifQ%3D%3D

Also, do you need a profiling tool at all? You can analyse the timestamps on your own.

Thanks JoeJ, I am kinda new to Nsight, and it seems GTX680m is not supported (it is not listed on nvidia's support page, and nsight profiler is greyout in my case).

The profilling tool I mentioned is a handcrafted one based on timestamp, but with GTX1080 my timestamp queryheap is corrupted, so cant do any useful analyses with that GPU...

This topic is closed to new replies.

Advertisement