[DX11] Tile-based Deferred Shading in BF3 discussion

360GAMZ · 2012-03-16T00:50:34

DICE released this presentation that talks about how their renderer uses tile-based deferred shading with DX11: http://publications.dice.se/attachments/GDC11_DX11inBF3_Public.pptx The tile-based approach starts on slide 10. On slide 12 they say they use 1 thread per pixel, and 16x16 thread groups per tile. To process the entire screen, I assume they use the ID3D11DeviceContext::Dispatch() parameters to spawn a bunch of those 16x16 thread groups. For example, for a resolution of 1360x768, they'd call Dispatch( 85, 48, 1 ). Does that sound about right? On slide 13 they have each thread group determine the min/max depth for its 16x16 pixel screen tile. This is done through groupshared data and interlocked instructions. Slide 15 describes how they perform culling of the light list vs. the screen aligned bounding box established on slide 13. Instead of each thread in the 16x16 thread group processing a pixel, now each thread processes a light from the incoming light list and, if that light intersects the bounding box, that thread adds the light index to the group shared list of lights. At the end of this phase, each thread group has a list of lights that potentially intersects the pixels in that tile. Slide 15 handles only point lights. What if we wanted to handle both point and spot lights? Two ideas come to mind. One is to expand struct Light to include additional parameters needed for spot lights. Another is to use two independent structures, one for point and the other for spot. In the first case, we continue to use a single for() loop and conditionally select which intersect test to use based on the light type. In the second case, we use two for() loops, first processing the point lights and then another for() loop to process the spot lights. The second approach feels like it should be more efficient than the first due to coherency between the threads in the thread group. Slide 16 switches back to processing pixels. Each thread iterates through the list of lights potentially intersecting its bounding box and performs the lighting calculation for its pixel. This all makes sense. Is there further culling that should be performed at this stage? For example, would it be beneficial to test each pixel to determine whether it intersects the spot light cone? Or probably better to simply use a clamp instruction? One thing not mentioned in the presentation is how they make the initial unculled list of lights available to the Compute Shader, other than that they use a StructuredBuffer for the light data and a Constant Buffer for the # lights. According to NVIDIA, if a Buffer is created as Dynamic, it resides in AGP memory all the time. You can lock it, update selective portions, and unlock it and yet nothing will get uploaded to the graphics card. When the shader reads from the buffer, only the needed data is uploaded at PCI speeds, but the entire buffer is never uploaded to video memory. In contrast, non-dynamic buffers reside in video memory. They can be updated with UpdateSubresource, in which case the data updated is copied to a temporary buffer in system memory and eventually uploaded to video memory before the shader needs it. The first method is slower for the graphics hardware (reading memory over PCI is slower than reading it from video memory), and the second method imposes more overhead on the CPU (from all that copying). Since the unculled list of lights probably changes every frame, it's unclear which method would be faster. But it's easy to switch between the two methods, so once I get to that point, I'll try them both. My gut feel is that with so many threads accessing the light buffer, it's probably best to go with the UpdateSubresource method and have everything reside in video memory.

Graphics and GPU Programming Programming DX11

Started by 360GAMZ December 15, 2011 07:22 PM

22 comments, last by olaolsson 12 years, 2 months ago

MJP

20,297

January 04, 2012 11:06 PM

It's not clear to me why rendering translucent geo into a render target with the blend mode set to multiply wouldn't work.

I'm sorry, I misunderstood your approach. Never mind that part about the blending modes.

So I bind the depth buffer as a SRV and run the pixel shader at per-pixel frequency by not specifying SV_SampleIndex as an input to the shader? Then, just simply read the depth texture and write it out to SV_Depth?

It sounds like this method (depth buffer resolve shader) is a better choice for our application. We draw a lot of translucent particles like smoke and so rendering that into a non-MSAA buffer sounds like less bandwidth. And since the particles tend to have smooth texture edges, MSAA probably wouldn't benefit us much.

Yup. In our engine at work we actually take this concept a step further and downsample the depth buffer to half-sized, so that we can render expensive things (volumetrics, really dense smoke, etc.) to a half-sized render target and save performance.

The Blog | The Book

olaolsson

142

March 15, 2012 06:08 AM

Hi,

Just thought I'd point you towards a paper about tiled shading, and associated OpenGL demo, by, *ahem*, myself. The paper is sadly paywalled by JGT, but I've put up a preprint, which is not hugely different from the published paper (it contains some bonus listings that were removed dues to space restrictions), on my web site. You may be able to access the published paper from a uni library or similar.

http://www.cse.chalm...d=tiled_shading

The main takeaway is a much more thorough performance evaluation and analysis, the introduction of tiled forward shading (which enables easy handling of transparent geometry).

In relation to the discussion here. I go a different way to the others and do the tile intersection by first transforming the lights to screen space, and then testing the screen space extents against each tile. On the CPU I do it scan line fashion, which is as efficient as it gets, but somewhat hard to do in parallel. Therefore the GPU version does a brute force tiles-test-all-lights approach, much like others have done, but with a much cheaper aabb/aabb test (2D extents + depth range). This saves constructing/testing identical planes all over the place.

The demo only implements the CPU variety, and without depth range (though I may update that).

Hope you find this useful.

Cheers
.ola

Hodgman

52,718

March 15, 2012 07:30 AM

I am working on a deferred pipeline for PC. Since tile based technique has been implemented on X360, can anyone say me the advantages and disvantages of tile based over quad based deferred in DirecX 10??

I haven't used it to optimise my deferred shading yet (I'm planning on it and have high hopes), but applying the same tile-based optimisations to shadow-filtering, DOF, SSAO and FXAA has been a huge win for me on DX9-PC and the 360/PS3.

. 22 Racing Series .

olaolsson

142

March 16, 2012 12:50 AM

mmm, interesting, Im going to implement a light volume technique in a first moment (I understand it better), and then I will try to implement the tile-based to see the performance difference .

Thanks for the answers!

So, to underline the main difference: Traditional deferred shaing is typically memory bound, whereas tiled deferred shading completely eliminates this bottleneck and is squarely compute bound. Given this, you can get an idea of how much better it will perform on your platform, either by looking at performance numbers, or by simple experimetation (e.g. vary G+Buffer bit depth). Both xbox 360 and PS3 have a very high compute to bandwidth ratio, and this is true for modern GPUs as well, and increasingly so.

As I found in my experiments, going between GTX 280 and GTX 480, shading performance doubles for tiled deferred, whereas my implementation of traditional deferred shading scales by the expected 30%, corresponding to the increase in memory bandwidth.

Anyway, of course, if you have massively complex shaders you may not be memory bandwidth bound (yet) but its a pretty safe bet you will be sooner or later as memory bandwidth fall further and further behind. If rumours about GTX 680 are to be believed we'll see this gap widen significantly again in this new generation.

Cheers
.ola

[DX11] Tile-based Deferred Shading in BF3 discussion

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

[DX11] Tile-based Deferred Shading in BF3 discussion

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines