Horrible performance or not ?

Started by
9 comments, last by bronxbomber92 11 years, 8 months ago
I've implemented a rather basic deferred renderer in DirectX11 but seem to be getting horrible performance even in the clearing stage.

I've got 4 GBuffer targets that I'm rendering to. When inputting those into a clear shader which basically just sets everything to 0 or 1.0f for depth. This pass already takes almost 1 milliseconds to do. Now when rendering the GBuffer pass also it doubles to almost 2ms for just a single textured plane.

I just can't believe this is normal or is it ?
I know my graphics cards might not be the best out there but 2ms just for that ? Is it maybe that 4 targets are too much of a strain on the memory bandwith or something ?

My hardware:
Core i5 2.8ghz
Radeon 5750M (mobile) 1GB
4GB ram

And the sample is running in 1280x720 without any AA solution.
Advertisement
Are you measuring the time it takes for the GPU to perform its part of the job or the time it takes to issue the commands aswell ?

If you are measuring the full frame time you should make sure you are doing so with release builds.
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
I'm measuring the full frame time, and the release build does not make it any faster : /
I realize you can't pinpoint me to the exact problem I just need to know if there's something not right here. I just don't get how rendering a fullscreen quad and just outputting 0.0f or 1.0f to 4 RT's increases the frame time THAT much..?

Looking down on a single plane with texture on it and a directional light I'm getting around 390fps (so about 2.5ms frame time) in 1280x720 that seems a little too much to me.

Is there a way to further explore the performance impact with pix or something ?
Seems fairly reasonable to me. What is your graphics card texture bandwidth and fill rate?

4 32-bit g-buffers at 1280x720 takes up about 15MB. So roughly 15 million bytes that you are writing to. And then sampling from, and writing to another 4 million bytes.

Assuming texture sample bandwidth and pixel fill rate are roughly equal (they probably aren't), that's about 35 million bytes in 2.5ms.

Or roughly 14GB/s (or 112 Gbps). Does that correspond to your graphics card's specs?
You should never clear the whole GBuffer. Simply clearing the depth should be enough.

You should never clear the whole GBuffer. Simply clearing the depth should be enough.


That is the first time I've heard that, can you elaborate why ?


@phil_T
Those are the stats according to AMD's site:

Engine clock speed: 550 MHz
Processing power (single precision): 440 GigaFLOPS
Polygon throughput: 550M polygons/sec
Data fetch rate (32-bit): 44 billion fetches/sec
Texel fill rate (bilinear filtered): 11 Gigatexels/sec
Pixel fill rate: 4.4 Gigapixels/sec
Anti-aliased pixel fill rate: 17.6 Gigasamples/sec
Memory clock speed: 800 MHz GDDR5
Memory data rate: 3.2 Gbps GDDR5
Memory bandwidth: 51.2 GB/sec
TDP: 25 Watts

4 32-bit g-buffers at 1280x720 takes up about 15MB. So roughly 15 million bytes that you are writing to.
QFE - phil_t's on the money here. Doing this in 1ms indicates a frame-buffer write bandwidth of about ~14GiB/s, which isn't the highest I've seen, but might be typical for a "mobile" version of a card.
If you could profile your GPU in depth, you'd probably find that this operation is entirely ROP bound (frame-buffer write operations), so looking at the theoretical fill-rate of the card will give you a "speed of light" value (theoretical limit).
[edit]Apparently your specific card has a theoretical max of 4.4 billion pixel writes (probably 32-bit ones) per second, so in theory, your 1280*720*4 buffer should take at least ~0.84ms. What's your actual measured "almost 1ms" value?
That is the first time I've heard that, can you elaborate why ?
There's no point clearing any buffer that you're going to overwrite the contents of later on. Assuming that geometry always fills your entire screen, then new geometry is going to fill your g-buffer anyway, so clearing it is a waste of time.
Alright so to give a little more accurate results here is what I've measured.

0.33ms for drawing nothing
0.53ms for clear depth RT only
1.28ms for clear & GBuffer
1.65ms for clear, GBuffer and Lighting
1.92ms for clear, GBuffer, Lighting and Compose

Clear Pass = 0.2ms
GBuffer Pass = 1.08ms
Lighting Pass = 0.37ms
Compose Pass = 0.27ms

Writing only to the GBuffer depth target in the clear pass made it a bit faster.

There's no point clearing any buffer that you're going to overwrite the contents of later on. Assuming that geometry always fills your entire screen, then new geometry is going to fill your g-buffer anyway, so clearing it is a waste of time.


In this case and in general this is true.

However, there is a point clearing the render targets and that is the case when using SLI/Crossfire setup. That is one of the ways that the driver is able to recognize which surfaces aren't needed by the other GPU and may skip the transfer of framebuffer between the GPU memories. So keep your clear code there for the case when number of GPUs is bigger than 1.

Otherwise, you may save some bandwidth if you use the hardware z-buffer for position creation instead of using another buffer for depth. The quality isn't as good, but should be enough for typical scenarios.

Best regards!
Good points Kauna.

In my engine, I force the user to be explicit about their usage of a render-target when they bind one -- the bind API forces them to choose some enum values, which boil down to:
1) When I bind this target, I need the previous contents to remain intact.
2) When I bind this target, I need it to be cleared to a specific value.
3) I'm going to overwrite every pixel in this target, so I don't care what it's initial values are upon binding.
4) I don't care whether any values are actually written to this target or not.
5) When I'm finished with this target, I need it to be cloned into this texture.

I haven't don't it yet, but I should apply your advice to case #3 -- if someone binds a target in this mode, I usually avoid clearing (actually I clear to a random colour in non-shipping builds only, to test that their choice is valid), but I should issue a clear command if they are using an SLI setup.

#4 is used, for example, when you're rendering depth only, but the underlying API requires you to bind a colour target anyway (e.g. GLES). This can be used to tell the driver not to 'resolve' the colour target, even though it's bound.

#5 is used when you want 2 or more copies of the rendered data. On some API's, it can be done quicker using 2 resolve commands, instead of 1 resolve + 1 copy.

This topic is closed to new replies.

Advertisement