Sign in to follow this  
Mr_Fox

Use Buffer or Texture, PS or CS for GPU Image Processing?

Recommended Posts

Hey Guys,

 

I currently working on a project which need to use GPU to run a bilateral filter over an depth frame, and need to do it as fast as possible.  So basically I will read a depth buffer into GPU and run a separable bilateral filer on it (which first filter it horizontally and then vertically)

 

So here are some decisions I have to made:

1. Should I make the depth frame a Texture2D or Buffer?

I have read some articles which says Texture is good for random access (Morton pattern memory layout), while Buffer is good for linear access. But in my case, during the first pass my PS or CS will read the data kinda linearly since its a horizontal pass, but when I run the vertical pass, in terms of memory access, it's more like a random access style since I will be reading data in a column by column pattern... I am not super sure whether there are other differences between Buffer and Texture, so I need advises and explanation on this.

 

2. Should I use pixel shader or compute shader?

I have see the constant time CS filtering algorithm, which is amazing. But I also profiled some of my image filtering algorithm between PS CS, and find out PS can run much faster than CS when filter kernel size is small. Also I was told the PS has some special hardware which are not exposed to CS to do Texture related things faster (which make question 1 more interesting). So I think I need to know which kind of task is good for PS or CS.

 

I know I should profile it myself and use the faster one based on the result. But having more advises before I start is always better :-)   Also my project will be targeting on future GPU, so I think those decisions should be made based on the understanding of Texture, Buffer, CS\PS advantages, rather than blindly trying these combinations on current GPU.

 

Thanks in advance.

 

Peng

Share this post


Link to post
Share on other sites

I haven't done a direct comparison myself, but you have already stated that it depends on the filter size.  You also mentioned that the PS has access to some texture filtering instructions that aren't available to the CS - but will you make use of filtering operations?  It sounds like you already know quite a bit about the difference between the two shaders, so you just need to apply that to your specific needs and see which one is needed.

 

By the way, there is a separable bilateral filter implementation available in my Hieroglyph 3 engine in case you want to start out there.  I would be interested to hear what choice you make on this topic!

Share this post


Link to post
Share on other sites

You say that you want "read a depth buffer into [the] GPU". Does that mean this depth buffer wasn't generated by the GPU itself? Where did the data come from originally?

 

If this is a depth buffer that was generated by the GPU earlier in the frame then I can't imagine you being able to transform the data into a new memory layout that makes your bilateral filter faster while still saving you more time than it cost you to transform the data in the first place.

 

You're right that each thread's access to the buffer is linear (access just a single row) but you need to consider that the warp/wavefront may not necessarily form a row of threads. In the case of a pixel shader the threads are likely to form some sort of rectangular shape (eg 8x4, 4x4, 8x8) or series of smaller rectangular shapes - they won't be Nx1. For that reason the fact that texture data is laid out in some sort of vendor-specific Morton-order-esque format is not such a problem.

 

It certainly might be interesting to read the data as a texture in whatever format the GPU gives you, perform the horizontal pass and then transpose the data on output from that first stage so that it's in a column major format (either a buffer or a 720x1280 diagonally mirrored image of the original).

 

Based on what I've seen titles do you may still want to be able to do some bilinear reads when reading the input (you lose that if you go for a Buffer<> approach). If I only had time to make one attempt at an implementation I'd choose a Compute Shader and not touch Buffers. If the kernel is small enough that it won't use a ton of group shared memory (LDS) then I'd do it in one pass, or in two passes if the kernel is much larger. Any algorithm that typically requires lots of unfiltered reads per pixel/thread is likely to perform much better in CS where all these unfiltered reads can be shared between lots of threads.

 

You mention that there are some hardware filtering options available to PS that isn't available to CS, but I can't think of any. What did you have in mind?

Share this post


Link to post
Share on other sites
You can sample and filter textures in compute shaders just fine, and also sample buffers from pixel shaders. The filter difference is you won't have texcoord derivatives provided in a cs because there aren't any, but you can still fake them. You can even do wacky stuff like not binding a render target to a pixel shader and use it for writing to a UAV only.

The big differences between the two IMO are the lack of shared memory in pixel shaders and slower writing to textures in compute shaders. A cs will be slower at processing a texture until it adequately leverages shared memory. Or in the future, leverages dynamic dispatches or swizzles.

Share this post


Link to post
Share on other sites

Thanks Jason and Adam for such quick reply.  I really appreciated.

 

 

 

"You mention that there are some hardware filtering options available to PS that isn't available to CS, but I can't think of any. What did you have in mind?"   --Adams

 

When the CS get first introduced, I was very interested in the overhead introduced by kicking off the CS, my guess during that time is CS will be faster since it doesn't need those extra setup and launching of vertex shader and rasterization ,etc. so I write a very naive Gaussian filter in CS(without the use of groupshared memory) and PS, the result I got shows PS is a lot faster, which confuses me for a long time... Then later I got intern in Activation working with one of their principle technique director, and I ask him about my confusion. He told me there are decent number of transistors in GPU that is not available for CS, and he named some tasks which use these graphic specific hardware which including texture filtering, and faster write to RT (compressed format) (sorry I didn't ask for more details and forget other tasks he mentioned...). And Dingeberry may know more detail according the reply. It will be so great if someone could talk about those specific hardware and explain their jobs.

 

 

 

"You say that you want "read a depth buffer into [the] GPU". Does that mean this depth buffer wasn't generated by the GPU itself? Where did the data come from originally?" --Adams

 

The depth map is generated by a depth sensor (Kinect2) so it is from CPU. And this got me another questions:

 

Should I create the texture/buffer directly from the upload heap or first copy my depth map to upload heap and then copy to a texture/buffer in Default heap? The msdn docs says upload heap is not fast as default heap (I am super curious what make upload heap access is slower than default heap, are there different vdram zones for them? or cache settings cause the perf difference?). In my project, the depth map will be generated around every 16ms, and after copying to GPU, there will be only my bilateral filter pass touching this buffer/texture directly (this pass will output the result to a default heap for later process for sure), so I guess I have to profile this to see whether the overhead of the extra copy from upload heap to default heap worth it... 

 

 

"Based on what I've seen titles do you may still want to be able to do some bilinear reads when reading the input (you lose that if you go for a Buffer<> approach). " --Adams

Yes, I have seen some one use one linear sample to extract enough information he need for that pixel location along with its neighbor pixels within that quad. I guess that is a little faster trick I hope to have.

 

Thanks guys for sharing your knowledge, I will let you know what I have found.

 

Peng

Share this post


Link to post
Share on other sites

Check out http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF

 

On page 10: 

 

 

To reduce DRAM bandwidth demands, NVIDIA GPUs make use of lossless compression techniques as data is written out to memory

 

AMD does something similar (I think it's basically the same thing). Compute shaders don't use rops so they aren't going to get any benefits from them. A compute shader is pretty generalized so the data it outputs isn't necessarily going to be correlated, or even coherent. 

Edited by Dingleberry

Share this post


Link to post
Share on other sites
To make CS a win, you could load a tile into LDS, and do the horizontal / vertical filtering on a copy of this data also in LDS.
You would read every pixel only once, access is linear, should be very fast.
I assume it's still faster with the additional complexity of handling the tile borders.

Share this post


Link to post
Share on other sites

How do you undo the morton pattern? Would you create it with something like D3D12_TEXTURE_LAYOUT_STANDARD_SWIZZLE? That seems not not be the intended usage of the flag but it also implies that an undefined swizzle will be, well, undefined. 

 

If the hardware is automatically translating texture indices for you, could you maybe alias the texture memory as a buffer, write a known pattern, and then undo the pattern?

Share this post


Link to post
Share on other sites

How do you undo the morton pattern? Would you create it with something like D3D12_TEXTURE_LAYOUT_STANDARD_SWIZZLE? That seems not not be the intended usage of the flag but it also implies that an undefined swizzle will be, well, undefined. 

For the unknown swizzles, you can only guess standard patterns, and attempt to see if they're faster (since it's GPU specific). Obviously on console it's way easier since the HW is fixed.
Checkout this tweet history.
 

If the hardware is automatically translating texture indices for you, could you maybe alias the texture memory as a buffer, write a known pattern, and then undo the pattern?

I suppose you could but I'm not sure if it's legal to do that. Might not work on future hardware? Honestly I don't know.

Share this post


Link to post
Share on other sites

I have a maxwell gpu at home so I only have tier 1 resource heaps and can't alias them like that :(. But I think it should work if you created a buffer and texture out of the same heap memory?

Share this post


Link to post
Share on other sites

I did some testing and if you create a buffer with sequential data, copy it to a texture that was created using CreatePlacedResource, and then call  ReadFromSubresource on the texture it returns a really funny pattern:

2, 64, 258, 320, 6, 68, 262, 324 ...

Kind of weird because doing this on a texture created CreateCommittedResource yields sequential data. I don't think this is a useful access pattern that it's giving me back. I'm not really sure what I'm looking at.

Share this post


Link to post
Share on other sites

I did some testing and if you create a buffer with sequential data, copy it to a texture that was created using CreatePlacedResource, and then call  ReadFromSubresource on the texture it returns a really funny pattern:

2, 64, 258, 320, 6, 68, 262, 324 ...

Kind of weird because doing this on a texture created CreateCommittedResource yields sequential data. I don't think this is a useful access pattern that it's giving me back. I'm not really sure what I'm looking at.

You're probably looking at a vendor-specific D3D12_TEXTURE_LAYOUT_UNKNOWN ordering.

What texture layout and resource state arguments did you give to those functions?

Share this post


Link to post
Share on other sites

It was an unknown layout and copy-dest since it was in a readback heap. I tried putting it into the default heap too but that didn't change anything. So it's just upload buffer -> readback texture, results vary based on whether it's made with CreateCommittedResource/CreatePlacedResource.

Share this post


Link to post
Share on other sites

Interesting :) Well, now you're probing the undefined internal behaviors of your vendor's specific d3d driver logic :lol:

Converting from the initial data format into the optimized "unknown layout" has a cost associated with it -- drivers must make a guess whether they'll pay this cost at all (in order to make later memory accesses perform faster) and if so, at what point they will pay that cost. You've discovered that your driver is deciding to perform this transformation sooner if the user is performing their own memory management, and later if the user asks the driver to perform the memory management... I guess that the driver is making the guess that a placement resource will be longer lived than a driver-managed resource?

 

BTW, yes, those numbers you posted do seem like some kind of Morton order / z-order curve, possibly unique to your GPU.

Edited by Hodgman

Share this post


Link to post
Share on other sites

Hey Guys,

 

I got a interesting finding: 

For linear buffer reading Compute Shader is around 15% faster than Pixel Shader in my GTX 680m.

 

To be more specific: My dx12 program created a permanently mapped buffer in Upload heap and will copy image buffer from camera. Then I can use CS or PS to copy(render) this image into a Texture (swizzled buffer in default heap). So the CS/PS is simple and almost the same, just read and write/output, no TGSM (threadgroup shared memory) used in CS.

 

My expectation is that, since I didn't use TGSM in CS, PS should run a little faster, since I observed this PS speed up long time ago when copy Texture to another Texture buffer (but these texture buffers are all swizzled in default heap). However, the result suggests that for reading unswizzled buffer in Upload heap CS may be faster...

 

I can't find any reasonable explanations for this. So it will be greatly appreciated if someone could confirm this (it may be possible that I messed up something in PS to get worse result than CS) and explain why CS is faster.

 

Thanks in advance

 

Peng 

Share this post


Link to post
Share on other sites

Why aren't you just using a copy function to copy the texture data? Shouldn't that ideally be the fastest way since it doesn't involve pipeline state?

 

Also, yeah, pixel shaders probably won't be as fast at reading linear data because they render in sets of 2x2 quads, not like 32x1 lines. But your typical texture isn't going to be stored linearly -- even in this case you're really copying from a buffer and storing to a texture, not reading and writing to/from a texture, which is the more common gpu image operation.

 

https://developer.nvidia.com/sites/default/files/akamai/gameworks/images/lifeofatriangle/fermipipeline_distribution.png

 

Basically just note that stuff gets shaded in blocks -- if it were a linear distribution of shaders, you'd see it shaded as horizontal line strips. 

Edited by Dingleberry

Share this post


Link to post
Share on other sites

Why aren't you just using a copy function to copy the texture data? Shouldn't that ideally be the fastest way since it doesn't involve pipeline state?

 

Also, yeah, pixel shaders probably won't be as fast at reading linear data because they render in sets of 2x2 quads, not like 32x1 lines. But your typical texture isn't going to be stored linearly -- even in this case you're really copying from a buffer and storing to a texture, not reading and writing to/from a texture, which is the more common gpu image operation.

 

https://developer.nvidia.com/sites/default/files/akamai/gameworks/images/lifeofatriangle/fermipipeline_distribution.png

 

Aha, I should be honest that I am not just read and output, I convert the data before output to satisfy my data processing requirement. And that's why I never think of using the copy function.

 

And thanks for your reply, that really help. But another following question is that the output is texture, so write to buffer is not linear, will PS be benefited by that? since PS should be optimized for swizzled and compressed output to rendertarget?

 

Thanks again

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Will the driver make the decision whether to use swizzled read or linear read based on your buffer desc and whether you are using sampler? In my case my input buffer is linear in upload heap, and my ps/cs didn't use sampler. my expectation is that the driver will treat this buffer like a 'constant buffer' instead of 'texture buffer'. But I may be totally wrong.

 

Also does that mean if I use uva write to a swizzled buffer, GPU will generate code to swizzled the idx? or there are specific hardware in something like DMA unit do it automatically? 

Share this post


Link to post
Share on other sites

Pretty sure the resource will be swizzled/non swizzled depending on the layout you tell d3d it is. So reads/writes/copies should always be respecting the memory layout. I don't know when exactly the swizzling happens, but naturally copying (i.e. ID3D12GraphicsCommandList::Copy) will do the right thing, and writing to an x, y position in a texture uav will also do the right thing.

 

If you're uploading texture data from some outside, linear layout source, then at some point you need to swizzle it. It's probably not going to make a huge difference if you do it with a copy or a shader -- in both cases you're doing a (presumably) single read from upload memory. From that point it might be best to leave it as a texture for further image processing.

 

If you suspect that doing linear memory processing on it using a compute shader will be faster, then try leaving it as a buffer resource for all your compute processing and you have the option of doing perfectly coalesced reads. You'll still be doing exactly one upload->default memory transfer and one linear->swizzled operation.

Edited by Dingleberry

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this