[DX11] Tile-based Deferred Shading in BF3 discussion

Started by
22 comments, last by olaolsson 12 years, 1 month ago

Currently I store the worldLightPos and viewLightPos matrixes in 1 RWStructuredbuffer, and I transform them from world to view with a ComputeShader to the same RWStructuredBuffer. But I haven't measured the performance.

I don't think you will have conflicts with a Map/Unmap, but maybe the staging buffer is a good way to go. Then you have more control of what is allocated in the memory.


Please keep posting your results, it is an interesting read!


That's a really interesting idea. So you're saying that your light buffer has space for both the world and view positions, but the view position is placeholder until the shader writes the transformed data to it?
Advertisement

[quote name='360GAMZ' timestamp='1324090742' post='4894682']
I'm forging ahead on my implementation of tile based CS lighting. One thing I ran into is that since the mini-frustum vs. light culling that the threads do is in view space, my light data (position and direction) needs to be in view space, too. In my game, all lights are stored in world space, so I could simply transform them to view space on the CPU as they're being written to the StructuredBuffer. I'm not too excited about doing this since our games tend to be CPU limited.

Why not convert the mini-frustums to world space instead? This would effectively require you to get the world space position and orientation of the camera, then you can generate your mini-frustums from that. That way your lights stay in world space, your mini-frustums are in world space, and no transformation is required on the CPU or GPU.

Would that work in your use case?
[/quote]

I think that should definitely work. Though, it would require 6 transformations instead of the 2 I'm currently doing: light to view space for culling and pixel position to world space for the lighting calc. Alternatively, I could do the lighting calc in view space, but I would have to transform the light to view space a 2nd time, so it's a wash. Unless I stored the transformed light for reuse in the lighting calc, but I believe 3 dot products is faster than a resource store + load.

[quote name='Jason Z' timestamp='1324146611' post='4894834']
[quote name='360GAMZ' timestamp='1324090742' post='4894682']
I'm forging ahead on my implementation of tile based CS lighting. One thing I ran into is that since the mini-frustum vs. light culling that the threads do is in view space, my light data (position and direction) needs to be in view space, too. In my game, all lights are stored in world space, so I could simply transform them to view space on the CPU as they're being written to the StructuredBuffer. I'm not too excited about doing this since our games tend to be CPU limited.

Why not convert the mini-frustums to world space instead? This would effectively require you to get the world space position and orientation of the camera, then you can generate your mini-frustums from that. That way your lights stay in world space, your mini-frustums are in world space, and no transformation is required on the CPU or GPU.

Would that work in your use case?
[/quote]

I think that should definitely work. Though, it would require 6 transformations instead of the 2 I'm currently doing: light to view space for culling and pixel position to world space for the lighting calc. Alternatively, I could do the lighting calc in view space, but I would have to transform the light to view space a 2nd time, so it's a wash. Unless I stored the transformed light for reuse in the lighting calc, but I believe 3 dot products is faster than a resource store + load.
[/quote]
Maybe I am not really understanding (sorry for beating a dead horse...) but if all of these are on your CPU side:

  1. Light data is in world space
  2. Frustum data is in view space
  3. Pixel position (in view space?)
  4. Lighting is carried out in view space



If all of that is true, then you should be able to convert the frustums to world space, reconstruct the world space pixel position instead of view position, and then carry out the lighting in world space. That would reduce the overall work needed on the GPU, while minimizing the work needed on the CPU (frustum data must be done on CPU). Am I seeing this correctly?


Very interesting topic!

I am working on a deferred pipeline for PC. Since tile based technique has been implemented on X360, can anyone say me the advantages and disvantages of tile based over quad based deferred in DirecX 10??

Thank so much!
Ardilla, not sure about consoles, but on PC tile based has been superior from what I've experienced. Andrew Lauritzen has a paper and full demo with source code that allows you to play around with various methods including tile based vs. quad based:

http://visual-computing.intel-research.net/art/publications/deferred_rendering/
Well you still get the main benefit, which is that you can batch multiple lights while shading each pixel which saves you bandwidth (both from sampling the G-Buffer, and blending the lighting result). What you lose out on by using a pixel shader is shared memory, which prevents you from doing the per-tile culling directly in the shader in the manner used by Frostbite 2 and Andrew Lauritzen's demo. So you either have to find some other way to do the tile->light association on the GPU, or you have to do it on the CPU.
mmm, interesting, Im going to implement a light volume technique in a first moment (I understand it better), and then I will try to implement the tile-based to see the performance difference smile.png .

Thanks for the answers!
I've run into a problem trying to render translucent objects into the scene after the deferred rendering has finished with the opaque objects.

Since a picture is worth a thousand words, here's my current DX11 rendering pipeline:

[sharedmedia=gallery:images:1545]

Since the translucent objects need to sort against the opaque scene, I want to reuse the depth buffer created during the deferred pass. However, the depth buffer is MSAA while the final render target is non-MSAA and so they can't be used together.

Here's one possible solution:

[sharedmedia=gallery:images:1544]

Here, the Lauritzen resolve shader is replaced with a shader that converts the flat StructuredBuffer into an MSAA render target (compute shaders cannot write to MSAA buffers, which is why Lauritzen uses a flat StructuredBuffer that holds all MSAA samples of the image). Since the lit render target is now MSAA, it can be used in conjunction with the MSAA depth buffer to render translucent objects. Finally, the ID3D11DeviceContext::ResolveSubresource() method is used to resolve the MSAA buffer to a non-MSAA buffer such as the back buffer.

Before I undertake this approach, I thought it would be a good idea to get feedback from the gurus here on this approach vs. any others that may come up. Here are a few questions:

1) Is it possible to wite such a shader to convert the flat buffer to a hardware compliant MSAA render target (meaning something the hardware can resolve to a non-MSAA buffer)? I'm not so sure this is possible since the flat buffer contains only the sample colors and no coverage mask.

2) If this method isn't possible, what are my alternatives? Can a depth buffer be resolved with ID3D11DeviceContext::ResolveSubresource()? If so, then Method 1 becomes much easier. [EDIT]: I've confirmed that a MSAA depth buffer cannot be resolved to non-MSAA.
The main problem with compositing is that you can't support arbitrary blending modes for your transparents. You can implement alpha blending and additive blending this way, but you couldn't also use other blending modes like multiply or screen. You can't automatically resolve a depth buffer, but you can do it manually with a pixel shader. Just sampling the first subsample and outputting it to SV_Depth should work well enough. Obviously you don't get MSAA with your transparents if you go this route.

To answer your first question, you can definitely write a pixel shader to convert from a structured buffer to an MSAA render target. To do it properly you'll need to run the pixel shader at per-sample frequency, which is done by taking SV_SampleIndex as an input to your shader. You can then use the pixel position + sample index to sample the proper value from the structured buffer, and then you just output it and it will get written to the appropriate subsample of the output texel. As far as D3D11 is concerned render targets only contain color data, not coverage. So you don't need to worry about that. There are exotic MSAA modes that decouple coverage and color (like Nvidia's CSAA), but you don't have direct access to that in D3D11 so you have to do it the standard way. As long as you still have your MSAA depth buffer, the transparent geometry will get rasterized and depth tested correctly.
Thanks for the incredibly helpful reply, MJP!


...but you couldn't also use other blending modes like multiply or screen.
[/quote]

It's not clear to me why rendering translucent geo into a render target with the blend mode set to multiply wouldn't work.

Just sampling the first subsample and outputting it to SV_Depth should work well enough. Obviously you don't get MSAA with your transparents if you go this route.[/quote]

So I bind the depth buffer as a SRV and run the pixel shader at per-pixel frequency by not specifying SV_SampleIndex as an input to the shader? Then, just simply read the depth texture and write it out to SV_Depth?

It sounds like this method (depth buffer resolve shader) is a better choice for our application. We draw a lot of translucent particles like smoke and so rendering that into a non-MSAA buffer sounds like less bandwidth. And since the particles tend to have smooth texture edges, MSAA probably wouldn't benefit us much.

This topic is closed to new replies.

Advertisement