DX11 - Pixel Shader 5 vs. Group Shared Memory and Atomic operations

Graphics and GPU Programming Programming DX11

Started by pcmaster August 15, 2011 11:21 AM

3 comments, last by pcmaster 12 years, 8 months ago

1,119

Author

August 15, 2011 11:21 AM

Greetings community,

we all know that SM5 brought the possibility to scatter stuff in pixel shaders, too (not only compute shaders). MSDN is rather brief on this topic. I can only see that I can use Interlocked*() instructions in both PS and CS. I suppose on UAVs. DeviceMemoryBarrier() seems to work in both PS and CS and it seems to be the only barrier instruction usable in PS. My question now is whether it's principally impossible to take advantage of the group shared memory in PS too. I don't see the API for that and maybe that makes sense. In GL4.2 I noticed they released the GL_ARB_shader_image_load_store extension, which obviously supports the same stuff but still nothing for the scarce but fast shared memory manipulation

I did implement various parallel algorithms in OpenCL, so although I might seem little confused now, I'm very much aware which memory is which and what's it good for in GPGPU via CUDA/OpenCL/DX11 CS.

Also, I see virtually nobody discussing using the atomic instructions outside compute shaders and wonder why. I see some OIT and Bokehs around which use Append Buffers. But I have a scenario where I need to rasterise normal geometry with a lot of textures and where I might benefit from being able to reduce a lot of info from pixel shaders using atomic operations on global (device) buffers, instead of writing out shitloads of texture data and reducing it parallelly afterwards. I'm not going to elaborate on my scenario further, I just state that I'll need to analyse what has been rasterised. I don't know how will the performance suffer if all units (fragments) try to write to the same memory location using InterlockedMax() or similar

Any thoughts on pixel shaders (not compute shaders!) and shared and atomic stuff in DX11?

MJP

20,295

August 15, 2011 07:18 PM

There's no way to access shared memory at all in pixel shaders. I would assume that the GPU is already using shared memory for coordinating pixel shader executions, but even if that's not the case the API has no means of using it. So you're out of luck on that one.

I really haven't played around too much with using UAV's in pixel shaders, aside from using an append buffer for bokeh (I wrote that sample you're talking about). I'd imagine it's pretty slow using device-wide interlocked operations due to the kind of synchronization required for that sort operation. Even interlocked adds on shared memory is pretty slow...if you look at any fast parallel reductions for compute shaders or Cuda you'll find that they all avoid atomics. But it would definitely be better to profile than to assume, so if you do try any experiments I'd love to know how they turn out.

The Blog | The Book

griffin77

125

August 16, 2011 01:48 AM

That is my understanding too. That there is magic going on behind the scenes when you compile a pixel shader that converts the pixel shader code into low-level GPU instructions that use shared memory and the like (basically everything you have to do yourself when you write CS or Cuda code).

I have used DeviceMemoryBarrier() in a pixel shader, the documentation is VERY sketchy. As I understand it this is basically a hint to tell the compiler all the GPU threads in the current block should finish accessing globak memory before continuing. Used correctly this should reduce the memory access overhead associated with different threads accessing global memory. But without a coherent description of exactly what this means in the context of pixel shader its difficult to know if I'm using it correctly. Does anyone know of a good description of what this function means in the context of a pixel shader ?

Jason Z

6,437

August 16, 2011 04:27 AM

Building on what the others have said, there is no access to the group shared memory in pixel shaders. If you consider for a moment how it is used in compute shaders, I think it will be clear why. In the compute shader, you specify how large the thread groups are that you will be working with, and how many of them will be executed in a particular dispatch. Part of your thread group size declaration is the declaration of how much shared memory it will be using. This gives very fine control over how many threads will be needing to access the memory, and you can design your algorithm very precisely to coordinate access to it.

In a pixel shader on the other hand, there is currently no method or concept of a thread group. Instead, it is up to the vendors to determine the optimal split size to be used when rasterizing a primitive, and then it is done more or less behind the scenes. This makes it impossible for a developer to write a shader that will have a coherent access strategy to the shared memory.

Who knows what will be coming in the next versions of D3D, but this seems like a logical extension of the possibilities. People have been talking about programmable rasterization for a while too, so perhaps sometime down the road there could be selectable group sizes for rasterization... That is just pure speculation though - I would be happy with a programmable rasterizer, but I don't know if one would ever come around and/or be useful...

Jason Zink :: DirectX MVP

Direct3D 11 engine on CodePlex: Hieroglyph 3

Direct3D Books: Practical Rendering and Computation with Direct3D 11, Programming Vertex, Geometry, and Pixel Shaders
Articles: Dual-Paraboloid Mapping Article :: Parallax Occlusion Mapping Article (original):: Fast Silhouettes Article

Games: Lunar Rift

pcmaster

1,119

Author

August 16, 2011 07:31 AM

Thanks guys, I see it's exactly as I thought. I'll try to experiment with the interlocked atomics, though.