[D3D12] Ping Pong Rendering

Started by
9 comments, last by SoldierOfLight 7 years, 11 months ago

I am trying to render an effect that requires ping-ponging back and forth between two textures. First I tried to implement it as a loop in a compute shader, with the two textures represented as RWStructuredBuffers and the output directed to a RWTexture2D. This resulted in a lot of "feedback" in the end texture, despite the fact that I put in calls to DeviceMemoryBarrierWithGroupSync after each iteration of the loop.

Then I tried setting up a loop in the C++ code to render via Pixel Shader back and forth between render targets, with Resource Barrier Transitions from Render Target to Pixel Shader Resource and back. This seems to only produce the output from the first ping-pong.

How should I be doing it, in theory? I liked the idea of the compute shader approach, but I'm sampling data outside of each execution's threadgroup quadrant of the texture, and that seems to be ruining things for me. I'm hesitant to Submit each render and then fence for completion, because that seems wasteful...

I guess basically, what's the best way to update a texture and ensure the update takes place before using it as a source?

Advertisement

The PS approach should work properly. Are you updating your SRV input descriptor as well as your RT output descriptor? Are you making sure that you don't overwrite it before the GPU has had a chance to read it?

I put together a pretty basic set up (three passes, one that writes red, one that writes green, and a blue) and verified that ping-ponging back and forth with the PS method works properly. I've got something else going on in my original approach that's messing me up.

I'm still wondering why the CS method doesn't work though. If I have a RWStructuredBuffer that contains all my texture data, and my various threadgroups are sampling from that buffer, shouldn't a call to DeviceMemoryBarrierWithGroupSync sync all the thread groups after writing to the "other" buffer, such that the next iteration of the loop will have a completely filled out buffer as its source data?

I'm not an expert at writing/debugging shaders unfortunately, my expertise is more in API usage patterns. It sounds like it should work, but I'm sure there's more to it than that.

I'm still wondering why the CS method doesn't work though. If I have a RWStructuredBuffer that contains all my texture data, and my various threadgroups are sampling from that buffer, shouldn't a call to DeviceMemoryBarrierWithGroupSync sync all the thread groups after writing to the "other" buffer, such that the next iteration of the loop will have a completely filled out buffer as its source data?

I had similar problem. Try to put UAV resource barrier after Dispatch. https://msdn.microsoft.com/en-us/library/windows/desktop/dn986740(v=vs.85).aspx

I'm sampling data outside of each execution's threadgroup quadrant of the texture, and that seems to be ruining things for me.

You can't do a global synchronization within a compute shader -- so if you're doing something like, say, sampling some neighbor pixels, combining them, and writing them back out, you can get a race condition. All that needs to happen is for one thread group to write out data before another thread group reads from the location it wrote to. You can't synchronize thread groups with each other, so sometimes ping ponging is a good way to do certain algorithms.

Huh. So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier? The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

Yes, but only those threads within the current Thread Group, not the entire Dispatch.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Adam Miles answer is correct. I'll just expand on it:

Huh. So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier? The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

A 8x8 ThreadGroup works on a group of 8x8 pixels. To process a 1024x1024 texture you'll need 16384 thread groups.

A DeviceMemoryBarrier will sync all transfers to global memory (such as RWStructuredBuffers, RWTexture2Ds) within the threadgroup (within that 8x8 block).

A GroupMemoryBarrier will sync all transfers to shared memory (everything declared as groupshared; which is usually stored inside an on-chip cache. In GCN this is called LDS Local Data Storage) also within the threadgroup.

The difference within these two barriers are which kind of memory they sync. But neither of them can sync with the whole dispatch. There is no intrinsic function to do such thing.

I guess basically, what's the best way to update a texture and ensure the update takes place before using it as a source?

Setting one texture as a shader resource and the other as a render target, render, and swapping the two seems simpler to me than a compute approach. The API will guarantee proper synchronization.

This topic is closed to new replies.

Advertisement