Measuring performance - overdraw

Started by
17 comments, last by maxest 12 years ago
Hey guys,

I have some problems with measuring performance, mostly with understanding overdraw rendering times.


Let me first describe you what I have in the scene. I have a plane mesh (composed of a number of triangles, something like 512). I'm looking my camera down this plane, so that it occupies the whole screen. The rendering time takes 190ns (where 1ns takes vertex processing - I determined that by rendering the plane in such a way that the camera does not see it). So roughly speaking pixel processing takes 189ns.

Now I added some "grass" mesh, which makes the whole scene look like in the screenshot grass.jpg (green is the "ground plane", where those fancy-colored polygons are part of the "grass" mesh; no alpha test or alpha blending, top-down view on the scene). Now, with the grass rendered, I got new timings:

plane mesh - 54ns (it now takes less space on the screen)
grass mesh - 370ns (260ns for vertex processing, so 110ns for pixel processing)

Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL) how is it that 189ns from the first test case (only plane mesh filling the screen) is different than 54ns+110ns = 164ns (only pixel processing work) in the second test case? The amount of pixel processing should be exactly the same in both cases, right? I could expect the second test case to be actually slower, because there are more pixels that need to be tested against the z-buffer.


Now something more interesting. When I changed the pixel shader to do much heavier computations that timings were better in favor to the first test case. I got something like:

first test case:
plane mesh - 9000ns

second test case:
plane mesh - 3000ns
grass mesh - 13000ns

So at this point I completely don't understand this lack of consistency with the versions working on the lighter pixel shader.


And finally, one more thing that bothers me most. In the very first scenario I described in this thread, with the plane mesh filling the whole screen and taking 190ns to render, I noticed that depending on how close the camera was to the plane, the render time varied, despite the fact that the plane was the only geometry entity filling the whole screen. I noticed that the closer the camera was to the plane, the less time it took to render it. My times varied from 180ns (as close to the plane as possible without frustum near plane clipping) to 220 ns (as far as possible to still fit the entire plane on the screen).


I would be very grateful for any tips explaining those time "anomalies" :).
Advertisement

And finally, one more thing that bothers me most. In the very first scenario I described in this thread, with the plane mesh filling the whole screen and taking 190ns to render, I noticed that depending on how close the camera was to the plane, the render time varied, despite the fact that the plane was the only geometry entity filling the whole screen. I noticed that the closer the camera was to the plane, the less time it took to render it. My times varied from 180ns (as close to the plane as possible without frustum near plane clipping) to 220 ns (as far as possible to still fit the entire plane on the screen).


Why is that a surprise? You sound like you somehow moved from thinking that only the number of polygons matters to thinking that only the number of pixels matters. You get closer to the plane, more geometry gets clipped, fewer triangles get rasterized -> less overhead. You are measuring ns. There are about a gazillion factors that come into play, many of them hidden from you by things the driver might be doing or just the way stuff happens in the hardware.

How are you even measuring the exact timing for vertex and pixel processing when these things should be processed in a pipeline on the hardware? If you have more than one triangle, it's not first going to process all vertices and then do all the pixels. One triangle is done and gets rasterized/shaded while at the same time the next triangle gets prepared. There are caches, the second time around a vertex might not even be processed anymore. Hardware today has unified shaders that can be dynamically used for either vertex or pixel processing, completely changing how long vertex and pixel processing will take.

There is little to no point to make ns measurements and expect any obvious and linear relationship between numbers. Not on modern hardware.
f@dzhttp://festini.device-zero.de
I've just changed the 512-triangled plane to 2-triangled plane and now the time varies from 180 to 181ns so you're right about it.

Any comments on the other issues? In particular, what about the scenario with 9000ns for rendering the plane, and 16000ns for rendering the plane and the grass? I thought that by using heavy pixel shader I could emphasize pixel processing workload, and get very similar timings (by this I mean small relative error).
Measuring timings on the GPU is fraught with peril. You need to make sure that you're actually measuring the full time taken to draw everything, rather than just checking time at the beginning and end of your frames. If doing the latter the only thing you're measuring is time taken to submit API calls and add them to the command buffer - the actual drawing might not happen until the next frame or the frame after that.

You can flush the pipeline (e.g. with glFinish) to ensure that you get a full measurement, but then your measurement won't be valid for real-world usage.

There is no linear relationship between amount of work done and time taken to do it. GPUs like to recieve data in large chunks, so it's the case that a few large chunks are preferable to many small ones. It should be obvious from this that drawing 100,000 objects as a few large chunks can be much more efficient than drawing 1000 objects as many small chunks.

Complexity of your shaders will also be a significant factor here.

Regarding overdraw, the term is effectively meaningless on all modern, and even most older, hardware. Your GPU is going to have some form of early-Z rejection, meaning that the depth test can run and reject pixels before the pixel shader runs. What this means is that even scenes where you think there should be heavy overdraw may in fact run surprisingly fast.

The only meaningful metric is total time for your frame, and the only meaningful test is to simplify your shaders and see what happens. That's not so bad. If simplifying your grass pixel shader shows a huge jump in performance, for example, then it should be quite obvious that this shader is a major bottleneck. But trying to get down to precise timings for every individual part of your renderer will run a high risk of confusing you and giving you invalid data.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL)
The pixel work isn't exactly the same -- the depth-test comes after the pixel shader, and the "early" depth-test comes before. The "early" version is usually approximate, and is allowed to let through false-positives (i.e. pixels can pass early depth test, run shader, fail depth test - no optimisation). Scenes with more complicated scene depth won't benifit as much from the early-depth test.

[quote name='maxest' timestamp='1335054641' post='4933651']Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL)
The pixel work isn't exactly the same -- the depth-test comes after the pixel shader, and the "early" depth-test comes before. The "early" version is usually approximate, and is allowed to let through false-positives (i.e. pixels can pass early depth test, run shader, fail depth test - no optimisation). Scenes with more complicated scene depth won't benifit as much from the early-depth test.
[/quote]
Are you serious? The actual depth testing function set with glDepthFunc comes after the pixel shader? Then what point would it make to do a depth pre-pass? Moreover, I did a test with a few "ground planes" laid one in front of another, and no matter how many planes I had, 5 or 1, the performance (FPS) was pretty much the same in both cases. So it is only this "grass mesh" causing so much trouble to the pipeline.

@mhagain: you think I am measuring the render time with stuff like QPC :)? I am aware that this kind of stuff only measrues CPU time and is not relevant whatsoever to what the GPU does. For timings I used http://www.opengl.org/registry/specs/ARB/timer_query.txt
Are you serious? The actual depth testing function set with glDepthFunc comes after the pixel shader? Then what point would it make to do a depth pre-pass?
The depth test happens before and after the pixel shader. Technically, it has to behave as if it occurs after the pixel shader, but GPU manufacturers are allowed to optimise their cards so that it actually occurs before.

In practice, GPU manufacturers do the depth test both before and after. The early/before test is only an approximation -- you can think of it as a lower resolution depth buffer, which catches [size=2]e.g. 90% of 'overdrawn' pixels. The later/after test is the full accurate version, and catches the [size=2]e.g. 10% that the early test missed.

What kind of GPU are you testing on?
Afaik NV GPUs use sort of conservative hierarchical depth testing, but I thought it catches 100% of overdrawn pixels. I realize that my knowledge regarding the GPU architecture is rather scarce, but I just can't understand why would a depth test be performed after the pixel shader, given that the pixel shader doesn't call any discard. And as I said, depth pre-pass for a couple of planes laying one over top of another works beautifully.

I use a laptop NV 540M.
It is performed afterwards because the pre-test isn't very fine grain with its rejection.

Lets say you have a 4x4 grid of pixel, the pre-text might be performed on a 2x2 grid, with each "pixel" in the coarse test covering 4 pixels in a quad for the real screen. This means the z value stored in that 2x2 grid must be as convervative as possible in order to not get false rejection.

Post pixel shader the 'real' z-test would be performed at a more fine grain level to catch the small number of pixels which would pass the pre-test but not the post-test.

So, if your planes are all aligned you would get 'perfect' early rejection, but as soon as things aren't perfectly aligned some pixels will pass the early test which shouldn't make it to the output buffers.
@phantom: what you're saying does seem to make much sense and answer the question why rendering of the grass with heavy pixel shaders takes so much time. It's obvious from the screenshot I provided that the grass mesh triangles are very "randomly" oriented (although Y-coordinate of their normal vectors is always 0) so many of their pixels, even if not visible to the camera, might end up in the pixel shader.

One thing that doesn't fit here is the first scenario I described with cheap pixel shaders, where the pixel processing time of the plane and grass was lower than the plane itself. Although I "estimated" the time vertex processing took for the grass simply by subtracting the time spent on rendering the grass while *not* visible on the screen. So I suppose this measurement can be biased to some extent. Still, interesting tips are welcome :).

This topic is closed to new replies.

Advertisement