Use Buffer or Texture, PS or CS for GPU Image Processing?

Started by
22 comments, last by Dingleberry 7 years, 11 months ago

Hey Guys,

I currently working on a project which need to use GPU to run a bilateral filter over an depth frame, and need to do it as fast as possible. So basically I will read a depth buffer into GPU and run a separable bilateral filer on it (which first filter it horizontally and then vertically)

So here are some decisions I have to made:

1. Should I make the depth frame a Texture2D or Buffer?

I have read some articles which says Texture is good for random access (Morton pattern memory layout), while Buffer is good for linear access. But in my case, during the first pass my PS or CS will read the data kinda linearly since its a horizontal pass, but when I run the vertical pass, in terms of memory access, it's more like a random access style since I will be reading data in a column by column pattern... I am not super sure whether there are other differences between Buffer and Texture, so I need advises and explanation on this.

2. Should I use pixel shader or compute shader?

I have see the constant time CS filtering algorithm, which is amazing. But I also profiled some of my image filtering algorithm between PS CS, and find out PS can run much faster than CS when filter kernel size is small. Also I was told the PS has some special hardware which are not exposed to CS to do Texture related things faster (which make question 1 more interesting). So I think I need to know which kind of task is good for PS or CS.

I know I should profile it myself and use the faster one based on the result. But having more advises before I start is always better :-) Also my project will be targeting on future GPU, so I think those decisions should be made based on the understanding of Texture, Buffer, CS\PS advantages, rather than blindly trying these combinations on current GPU.

Thanks in advance.

Peng

Advertisement

I haven't done a direct comparison myself, but you have already stated that it depends on the filter size. You also mentioned that the PS has access to some texture filtering instructions that aren't available to the CS - but will you make use of filtering operations? It sounds like you already know quite a bit about the difference between the two shaders, so you just need to apply that to your specific needs and see which one is needed.

By the way, there is a separable bilateral filter implementation available in my Hieroglyph 3 engine in case you want to start out there. I would be interested to hear what choice you make on this topic!

You say that you want "read a depth buffer into [the] GPU". Does that mean this depth buffer wasn't generated by the GPU itself? Where did the data come from originally?

If this is a depth buffer that was generated by the GPU earlier in the frame then I can't imagine you being able to transform the data into a new memory layout that makes your bilateral filter faster while still saving you more time than it cost you to transform the data in the first place.

You're right that each thread's access to the buffer is linear (access just a single row) but you need to consider that the warp/wavefront may not necessarily form a row of threads. In the case of a pixel shader the threads are likely to form some sort of rectangular shape (eg 8x4, 4x4, 8x8) or series of smaller rectangular shapes - they won't be Nx1. For that reason the fact that texture data is laid out in some sort of vendor-specific Morton-order-esque format is not such a problem.

It certainly might be interesting to read the data as a texture in whatever format the GPU gives you, perform the horizontal pass and then transpose the data on output from that first stage so that it's in a column major format (either a buffer or a 720x1280 diagonally mirrored image of the original).

Based on what I've seen titles do you may still want to be able to do some bilinear reads when reading the input (you lose that if you go for a Buffer<> approach). If I only had time to make one attempt at an implementation I'd choose a Compute Shader and not touch Buffers. If the kernel is small enough that it won't use a ton of group shared memory (LDS) then I'd do it in one pass, or in two passes if the kernel is much larger. Any algorithm that typically requires lots of unfiltered reads per pixel/thread is likely to perform much better in CS where all these unfiltered reads can be shared between lots of threads.

You mention that there are some hardware filtering options available to PS that isn't available to CS, but I can't think of any. What did you have in mind?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

You can sample and filter textures in compute shaders just fine, and also sample buffers from pixel shaders. The filter difference is you won't have texcoord derivatives provided in a cs because there aren't any, but you can still fake them. You can even do wacky stuff like not binding a render target to a pixel shader and use it for writing to a UAV only.

The big differences between the two IMO are the lack of shared memory in pixel shaders and slower writing to textures in compute shaders. A cs will be slower at processing a texture until it adequately leverages shared memory. Or in the future, leverages dynamic dispatches or swizzles.

Thanks Jason and Adam for such quick reply. I really appreciated.

"You mention that there are some hardware filtering options available to PS that isn't available to CS, but I can't think of any. What did you have in mind?" --Adams

When the CS get first introduced, I was very interested in the overhead introduced by kicking off the CS, my guess during that time is CS will be faster since it doesn't need those extra setup and launching of vertex shader and rasterization ,etc. so I write a very naive Gaussian filter in CS(without the use of groupshared memory) and PS, the result I got shows PS is a lot faster, which confuses me for a long time... Then later I got intern in Activation working with one of their principle technique director, and I ask him about my confusion. He told me there are decent number of transistors in GPU that is not available for CS, and he named some tasks which use these graphic specific hardware which including texture filtering, and faster write to RT (compressed format) (sorry I didn't ask for more details and forget other tasks he mentioned...). And Dingeberry may know more detail according the reply. It will be so great if someone could talk about those specific hardware and explain their jobs.

"You say that you want "read a depth buffer into [the] GPU". Does that mean this depth buffer wasn't generated by the GPU itself? Where did the data come from originally?" --Adams

The depth map is generated by a depth sensor (Kinect2) so it is from CPU. And this got me another questions:

Should I create the texture/buffer directly from the upload heap or first copy my depth map to upload heap and then copy to a texture/buffer in Default heap? The msdn docs says upload heap is not fast as default heap (I am super curious what make upload heap access is slower than default heap, are there different vdram zones for them? or cache settings cause the perf difference?). In my project, the depth map will be generated around every 16ms, and after copying to GPU, there will be only my bilateral filter pass touching this buffer/texture directly (this pass will output the result to a default heap for later process for sure), so I guess I have to profile this to see whether the overhead of the extra copy from upload heap to default heap worth it...

"Based on what I've seen titles do you may still want to be able to do some bilinear reads when reading the input (you lose that if you go for a Buffer<> approach). " --Adams

Yes, I have seen some one use one linear sample to extract enough information he need for that pixel location along with its neighbor pixels within that quad. I guess that is a little faster trick I hope to have.

Thanks guys for sharing your knowledge, I will let you know what I have found.

Peng

Check out http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

On page 10:

To reduce DRAM bandwidth demands, NVIDIA GPUs make use of lossless compression techniques as data is written out to memory

AMD does something similar (I think it's basically the same thing). Compute shaders don't use rops so they aren't going to get any benefits from them. A compute shader is pretty generalized so the data it outputs isn't necessarily going to be correlated, or even coherent.

To make CS a win, you could load a tile into LDS, and do the horizontal / vertical filtering on a copy of this data also in LDS.
You would read every pixel only once, access is linear, should be very fast.
I assume it's still faster with the additional complexity of handling the tile borders.

Is that offering a competitive advantage over texture sampling using a texture cache though?

I have researched into this very recently, so it's fresh in my memory:

For large kernels (kernel_radius > 4; which means > 9 taps per pixel), Compute Shaders outperform Pixel Shaders, until the kernel becomes large enough the difference is up to 100% on my AMD Radeon HD 7770.

However, you need to be careful about the CS method because maximizing throughput isn't easy.

"Efficient Compute Shader Programming" from Bill Bilodeau describes several ways on maximizing throughput, and GPUOpen has a free SeparableFilter11 implementation of the techniques described there with full source code and a demo with lots of knots to tweak and play with.

As for Buffer vs Compute, like you said linear layout is great for the horizontal pass, but terrible for the vertical pass; thus Textures performs better; also usually if you end up sampling this texture later on in non linear patterns (or need something other than point filtering), a Texture is a win.

You may want to look into addressing the texture from the Compute Shader in morton order to undo the morton pattern of the texture and hence improve speed when possible, but I haven't looked into that.

And of course, on D3D12/Vulkan, a Compute Shader based solution means an opportunity for Async Shaders which can increase speed on AMD's GCN, or decrease it on NVIDIA.

How do you undo the morton pattern? Would you create it with something like D3D12_TEXTURE_LAYOUT_STANDARD_SWIZZLE? That seems not not be the intended usage of the flag but it also implies that an undefined swizzle will be, well, undefined.

If the hardware is automatically translating texture indices for you, could you maybe alias the texture memory as a buffer, write a known pattern, and then undo the pattern?

This topic is closed to new replies.

Advertisement