[XNA] Deferred Rendering - Performance issues

Started by
11 comments, last by solenoidz 14 years, 1 month ago
As the topic suggests, I'm trying to write a deferred renderer in XNA. Luckily I found a tutorial that explains pretty much everything you need, and even provides a working sample: http://www.catalinzima.com/?page_id=14 I noticed that the provided sample had some serious graphical glitches though, which seemed to be related to depth testing. So I decided to write my deferred renderer from scratch using the information provided in this tutorial, and it ended up working just fine. Unfortunately the performance seems to be rather bad. Using a single lowpoly model and 20 point lights at a resolution of 1600x900, I end up dropping to below 60fps on a 9800gtx when I move my camera close to the model (more screen pixels are lit), which I think is unacceptable. I understand that due to the nature of deferred rendering, a very simple scene will still use a considerable amount of gpu time, but with "only" 20 point lights, which should be easy to handle for a deferred renderer, this seems excessive. Thinking that I made a mistake, I took catalinzima's sample, and changed it to run at 1600x900, and it performs just as bad as mine. I obviously want to solve this problem, before adding more features to the renderer, such as shadowing. My main problem is that I don't know how to search for bottlenecks in this situation. I tried running it using nvidia perfhud: The result is kindof obvious, I'm maxing out my pixel shaders, but in deferred rendering nearly all of the work is done by pixel shaders, so this doesn't really help. For some reason normal mapping is glitchy when running perfhud, as you can see in the picture. It's working fine when I run it normally: Is this a bug with perfhud, or could this indicate a problem with my normal mapping? I don't really know how to proceed, so I'm open for any suggestions. I can share my current project if it'll be of any help, but the code will be nearly identical to catalinzima's. Thanks, Hyu
Advertisement
Deferred shading has overhead by itself and it's pretty much constant and independent, so you might be just experience this. Try to complicate your scene, both with geometry and/or more lights to see if there is any difference.
Also try to run it only with one light. If there's no differnce either, I guess running it in lower resolutions like 640/480 will result in big jump in performance.
In most cases a deferred renderer will be bound by the number of pixels you end up shading in your lighting pass. This might seem obvious, but what isn't obvious is that there's not necessarily a direct relationship between the number of lights and the number of pixels shaded. Two reasons for this:

1. The number of pixels shaded for a light depends on how big it is, and how close to the camera it is. If all of your lights are large, you could end up with 20 full-screen passes for 20 lights.

2. Typically for a deferred renderer you use optimizations to reduce the number of pixels shaded for any particular light. This includes...

-Using bounding volumes to restrict the shaded pixels to the area where the light is actually affecting pixels
-Using depth testing to reject pixels that are "buried under ground" or "floating in mid-air"
-Using stencil testing to reject pixels where no G-Buffer data has been rendered, or to enhance the depth test
Thanks for the replies :)

I understand that deferred shading has quite an overhead by itself, especially if you crank up the resolution.
However, I was a bit surprised by how high this initial overhead is.

Quote:1. The number of pixels shaded for a light depends on how big it is, and how close to the camera it is. If all of your lights are large, you could end up with 20 full-screen passes for 20 lights.

Yes, this is the reason the framerate drops dramatically once I approach the ship with my camera.
I'm probably ending up "inside" 5 point lights, causing 5 full-screen passes, plus some "near-fullscreen" passes due to lights covering nearly the full screen.

Quote:-Using bounding volumes to restrict the shaded pixels to the area where the light is actually affecting pixels

I'm currently using simple sphere models to limit the shaded pixels for my point lights.

Quote:-Using depth testing to reject pixels that are "buried under ground" or "floating in mid-air"

Would you be so kind to elaborate a bit on this?
The G-Buffer only contains one "depth level" so I'm not sure what you mean.
I can't light a pixel below the floor, because I have no info of what is below the floor (which is the cause of the transparency issue with deferred shading).

Quote:-Using stencil testing to reject pixels where no G-Buffer data has been rendered, or to enhance the depth test

Say I'm outdoors, looking up into a skybox (which has no geometry), I'm technically looking into the void, and should reject those pixels, because any calculations would be useless.
Is this what you mean?



So in order to fix my performance problem, the most important thing would be light positioning?
If lights are far away, I could put 100 or so, because they're only going to lit a few pixels anyways, but I need to avoid positionning lights in a way so that a few of them affect big portions of the screen at the same time.


I also tried to mess around with my G-Buffer a bit, in order to get a bit more performance out of it.
Right now I'm using:

Color:     Red  8   Green  8   Blue  8   Specular Intensity  8Normal:      X 16       Y 16      Z 16       Speculat Power 16Depth:   Depth 32Light:     Red 16   Green 16   Blue 16             Specular 16Final:     Red 16   Green 16   Blue 16


I'm pretty sure I can get the Normals down a bit, but 32bit rgba won't cut it.
The light and final use 64bit rendertargets so I can easily add a hdr effect lateron.

I tried to reducing it to the following, to see how big the impact would be:

Color:     Red  8   Green  8   Blue  8   Specular Intensity  8Normal:      X  8       Y  8      Z  8       Speculat Power  8Depth:   Depth 32Light:     Red  8   Green  8   Blue  8             Specular  8


Interestingly enough, I did not notice any difference (except for gpu memory obviously) in performance at all, which I think is quite odd.
Do I assume correctly, that bigger rendertargets only influence the bandwidth, but not the actual shader processing time then?

And yet another question :)
As you can see I only save the depth information in my G-Buffer, and reconstruct the Positions in shaders.
Since this involves Matrix multiplications, could this cause a big performance impact?
Would it be better to save the full Position data into the G-Buffer instead of only the depth, using a 128bit Rendertarget?
I'm a bit worried because that's quite big, and I waste one 32bit channel.
I'm also not sure how well 128bit Rendertargets are actually supported on graphic cards.



Hyu
Consider taking Naughty Dog's (and some others) approach to lights and render using screen-space tiles instead of world-space spheres, and doing multiple light calculations in one pass of the shader. This requires less fetching of the G-buffer and depth buffer, and less writing to the render target. Also, if you, say, do 4 point lights in parallel, you can try rotating the calculations to a structure-of-arrays approach, which may help you cut down your shader cycles


I see an artifact in your specular that I've seen in the past from non-normalized normals. Double-check that you're normalizing them properly
Quote:Original post by Hyunkel
Would you be so kind to elaborate a bit on this?
The G-Buffer only contains one "depth level" so I'm not sure what you mean.
I can't light a pixel below the floor, because I have no info of what is below the floor (which is the cause of the transparency issue with deferred shading).


When you're rendering the lights you're only interested in geometry that intersects the bounding volume you're using for the light. In other words you want pixels that are closer than the backfaces of your bounding volume, or further away than the frontfaces of your bounding volume.

Now with standard depth testing we only get one depth test per pass, so we can't check that both of the above conditions are met in a single pass. However we can certainly check one of them. Which one you use depends on whether you're drawing the frontfaces or the backfaces of your bounding volume. If you're drawing the frontfaces (backface culling), then you set your DepthBufferFunction to GreaterEqual to reject all geometry that's in front of the frontfaces of your volume (AKA the portions of your light that are "buried underground"). If you're drawing the backfaces, then set your DepthBufferFunction to LessEqual to reject all geometry that's behind the backfaces of your bounding volume (AKA the portions of your light that are "floating in air"). You can do both tests if you perform a prepass that that outputs a result to the stencil buffer.

Quote:Original post by Hyunkel
Say I'm outdoors, looking up into a skybox (which has no geometry), I'm technically looking into the void, and should reject those pixels, because any calculations would be useless.
Is this what you mean?


Yup, that's exactly what I mean. This is pretty easy to do...just clear your stencil buffer to 0 and then set the stencil to 1 everywhere you render to your G-Buffer. Then when you render lights, just check that the stencil is greater than 0.

Quote:Original post by Hyunkel
So in order to fix my performance problem, the most important thing would be light positioning?
If lights are far away, I could put 100 or so, because they're only going to lit a few pixels anyways, but I need to avoid positionning lights in a way so that a few of them affect big portions of the screen at the same time.


Yeah in general you'll need to keep that in mind what positioning lights, so that you keep the performance cost per light down.


Quote:Original post by Hyunkel
I also tried to mess around with my G-Buffer a bit, in order to get a bit more performance out of it.
Right now I'm using:

Color:     Red  8   Green  8   Blue  8   Specular Intensity  8Normal:      X 16       Y 16      Z 16       Speculat Power 16Depth:   Depth 32Light:     Red 16   Green 16   Blue 16             Specular 16Final:     Red 16   Green 16   Blue 16


I'm pretty sure I can get the Normals down a bit, but 32bit rgba won't cut it.
The light and final use 64bit rendertargets so I can easily add a hdr effect lateron.

I tried to reducing it to the following, to see how big the impact would be:

Color:     Red  8   Green  8   Blue  8   Specular Intensity  8Normal:      X  8       Y  8      Z  8       Speculat Power  8Depth:   Depth 32Light:     Red  8   Green  8   Blue  8             Specular  8


Interestingly enough, I did not notice any difference (except for gpu memory obviously) in performance at all, which I think is quite odd.
Do I assume correctly, that bigger rendertargets only influence the bandwidth, but not the actual shader processing time then?


Changing your render target formats will definitely affect your bandwidth used to render the G-Buffer, your bandwidth used to sample the G-Buffer. It could also potentially affect how many cycles it takes the ROP's to output G-Buffer pixels, and also how many cycles it takes for the texture units to sample from the G-Buffer textures. If you're not bound by any of these things, then the format won't matter. If you go into the Frame Profiler in PerfHUD it will show you what's bottlenecking your performance for a particular frame, draw call, portion of the frame. To get stats for a portion of the frame, you need to use PIX perf markers. The tutorial in my sig explains how those work, and has a helper library that makes them easy to use for XNA.

BTW the Nvidia 6 and 7-series GPU's don't allow you to do MRT with render targets that have a different bit depth. So they all have to be 32bpp, or 64bpp. This means your first setup wouldn't work on those cards. Also for normals there are several different methods for encoding them into two components, instead of 3. There's a very detailed overview here.

Quote:Original post by Hyunkel
As you can see I only save the depth information in my G-Buffer, and reconstruct the Positions in shaders.
Since this involves Matrix multiplications, could this cause a big performance impact?


Like the render target formats, it will only affect your performance if you're bound by ALU's. With the PC you never know what your user will be running, so it helps to always minimize usage of a particular GPU resource whenever you can.

If you use a linear depth buffer, you can reconstruct position with a single MADD rather than a full matrix multiply. See this.

Quote:Original post by Hyunkel
Would it be better to save the full Position data into the G-Buffer instead of only the depth, using a 128bit Rendertarget?
I'm a bit worried because that's quite big, and I waste one 32bit channel.
I'm also not sure how well 128bit Rendertargets are actually supported on graphic cards.


Don't even bother. Go for the depth buffer approach.

Hey MJP,
Thank you for taking the time to answer all of my questions! :)
Your explanations make perfect sense and are going to help me alot to optimize my deferred renderer.
The links and tips you provided are especially helpful!
I wouldn't have thought about searching for a way to pack normals into two components, and until today, I haven't heard of PIX.


Quote:Consider taking Naughty Dog's (and some others) approach to lights and render using screen-space tiles instead of world-space spheres, and doing multiple light calculations in one pass of the shader. This requires less fetching of the G-buffer and depth buffer, and less writing to the render target. Also, if you, say, do 4 point lights in parallel, you can try rotating the calculations to a structure-of-arrays approach, which may help you cut down your shader cycles


I see an artifact in your specular that I've seen in the past from non-normalized normals. Double-check that you're normalizing them properly


The screen-space tiles approach seems quite straight-forward.
I'm not so sure about how to render point lights in paralell though.
I can send the data for 4 lights to a modified point light shader, but how do I make 4 spheres (or screen-space tiles) render in only one pass, and if I manage to do that, how do I know which light I'm rendering atm in the shader?

And yes, I need to get those normals checked :)

Thanks,
Hyu
Quote:Original post by Hyunkel

The screen-space tiles approach seems quite straight-forward.
I'm not so sure about how to render point lights in paralell though.
I can send the data for 4 lights to a modified point light shader, but how do I make 4 spheres (or screen-space tiles) render in only one pass, and if I manage to do that, how do I know which light I'm rendering atm in the shader?


There's some CPU work to be done. Namely, projecting spheres for your point lights into screen space and intersecting them with tiles. If you have a tile that intersects with 4 point lights, you create a shader that takes as shader constants 4 positions and 4 colors, and write a shader that calculates the lighting for each light and combines them together

You then have some combination of light shaders (you might have one that does 2 point lights, 3 point lights, 1 spot and 1 point, 4 spots and 4 points, whatever.)
Quote:Original post by Hyunkel
and until today, I haven't heard of PIX.


Yeah unfortunately PIX is something that has very low visibility unless you're doing native DX development, but is still incredibly useful if you're using XNA or SlimDX. I literally use PIX every day of the week at work, and I can't recommend enough that you get to know it if you're doing any sort of serious graphics development. It can save you enormous amounts of time when it comes to debugging problems with graphics.

Quote:Original post by Hyunkel
The screen-space tiles approach seems quite straight-forward.
I'm not so sure about how to render point lights in paralell though.
I can send the data for 4 lights to a modified point light shader, but how do I make 4 spheres (or screen-space tiles) render in only one pass, and if I manage to do that, how do I know which light I'm rendering atm in the shader?


You wouldn't make 4 spheres, you would instead make a quad or a box that covers the tile. If you want you can fix the min and max for X, Y, and Z for all light spheres within a tile and then use that to create a box that you render for all lights in that tile. If you did that you could still use the depth test to reject pixels, however the depth test would be more coarse than if you drew the bounding spheres individually. Which approach is better depends on your bottleneck is: drawing lights individually will usually help you save on ALU usage, while going for tiles will help you save more on bandwidth. Using the tiled approach can also help if you can't use hardware blending for your light render target, either because it's too slow or because you're using an HDR encoding format (like LogLuv or RGBM).

[Edited by - MJP on February 22, 2010 3:26:58 AM]

This topic is closed to new replies.

Advertisement