# Measuring performance - overdraw

This topic is 2133 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey guys,

I have some problems with measuring performance, mostly with understanding overdraw rendering times.

Let me first describe you what I have in the scene. I have a plane mesh (composed of a number of triangles, something like 512). I'm looking my camera down this plane, so that it occupies the whole screen. The rendering time takes 190ns (where 1ns takes vertex processing - I determined that by rendering the plane in such a way that the camera does not see it). So roughly speaking pixel processing takes 189ns.

Now I added some "grass" mesh, which makes the whole scene look like in the screenshot grass.jpg (green is the "ground plane", where those fancy-colored polygons are part of the "grass" mesh; no alpha test or alpha blending, top-down view on the scene). Now, with the grass rendered, I got new timings:

plane mesh - 54ns (it now takes less space on the screen)
grass mesh - 370ns (260ns for vertex processing, so 110ns for pixel processing)

Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL) how is it that 189ns from the first test case (only plane mesh filling the screen) is different than 54ns+110ns = 164ns (only pixel processing work) in the second test case? The amount of pixel processing should be exactly the same in both cases, right? I could expect the second test case to be actually slower, because there are more pixels that need to be tested against the z-buffer.

Now something more interesting. When I changed the pixel shader to do much heavier computations that timings were better in favor to the first test case. I got something like:

first test case:
plane mesh - 9000ns

second test case:
plane mesh - 3000ns
grass mesh - 13000ns

So at this point I completely don't understand this lack of consistency with the versions working on the lighter pixel shader.

And finally, one more thing that bothers me most. In the very first scenario I described in this thread, with the plane mesh filling the whole screen and taking 190ns to render, I noticed that depending on how close the camera was to the plane, the render time varied, despite the fact that the plane was the only geometry entity filling the whole screen. I noticed that the closer the camera was to the plane, the less time it took to render it. My times varied from 180ns (as close to the plane as possible without frustum near plane clipping) to 220 ns (as far as possible to still fit the entire plane on the screen).

I would be very grateful for any tips explaining those time "anomalies" .

##### Share on other sites

And finally, one more thing that bothers me most. In the very first scenario I described in this thread, with the plane mesh filling the whole screen and taking 190ns to render, I noticed that depending on how close the camera was to the plane, the render time varied, despite the fact that the plane was the only geometry entity filling the whole screen. I noticed that the closer the camera was to the plane, the less time it took to render it. My times varied from 180ns (as close to the plane as possible without frustum near plane clipping) to 220 ns (as far as possible to still fit the entire plane on the screen).

Why is that a surprise? You sound like you somehow moved from thinking that only the number of polygons matters to thinking that only the number of pixels matters. You get closer to the plane, more geometry gets clipped, fewer triangles get rasterized -> less overhead. You are measuring ns. There are about a gazillion factors that come into play, many of them hidden from you by things the driver might be doing or just the way stuff happens in the hardware.

How are you even measuring the exact timing for vertex and pixel processing when these things should be processed in a pipeline on the hardware? If you have more than one triangle, it's not first going to process all vertices and then do all the pixels. One triangle is done and gets rasterized/shaded while at the same time the next triangle gets prepared. There are caches, the second time around a vertex might not even be processed anymore. Hardware today has unified shaders that can be dynamically used for either vertex or pixel processing, completely changing how long vertex and pixel processing will take.

There is little to no point to make ns measurements and expect any obvious and linear relationship between numbers. Not on modern hardware.

##### Share on other sites
I've just changed the 512-triangled plane to 2-triangled plane and now the time varies from 180 to 181ns so you're right about it.

Any comments on the other issues? In particular, what about the scenario with 9000ns for rendering the plane, and 16000ns for rendering the plane and the grass? I thought that by using heavy pixel shader I could emphasize pixel processing workload, and get very similar timings (by this I mean small relative error).

##### Share on other sites
Measuring timings on the GPU is fraught with peril. You need to make sure that you're actually measuring the full time taken to draw everything, rather than just checking time at the beginning and end of your frames. If doing the latter the only thing you're measuring is time taken to submit API calls and add them to the command buffer - the actual drawing might not happen until the next frame or the frame after that.

You can flush the pipeline (e.g. with glFinish) to ensure that you get a full measurement, but then your measurement won't be valid for real-world usage.

There is no linear relationship between amount of work done and time taken to do it. GPUs like to recieve data in large chunks, so it's the case that a few large chunks are preferable to many small ones. It should be obvious from this that drawing 100,000 objects as a few large chunks can be much more efficient than drawing 1000 objects as many small chunks.

Regarding overdraw, the term is effectively meaningless on all modern, and even most older, hardware. Your GPU is going to have some form of early-Z rejection, meaning that the depth test can run and reject pixels before the pixel shader runs. What this means is that even scenes where you think there should be heavy overdraw may in fact run surprisingly fast.

The only meaningful metric is total time for your frame, and the only meaningful test is to simplify your shaders and see what happens. That's not so bad. If simplifying your grass pixel shader shows a huge jump in performance, for example, then it should be quite obvious that this shader is a major bottleneck. But trying to get down to precise timings for every individual part of your renderer will run a high risk of confusing you and giving you invalid data.

##### Share on other sites
Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL)
The pixel work isn't exactly the same -- the depth-test comes after the pixel shader, and the "early" depth-test comes before. The "early" version is usually approximate, and is allowed to let through false-positives (i.e. pixels can pass early depth test, run shader, fail depth test - no optimisation). Scenes with more complicated scene depth won't benifit as much from the early-depth test.

##### Share on other sites

[quote name='maxest' timestamp='1335054641' post='4933651']Now, since the amount of pixel work is exactly the same (in this scene I did depth pre-pass and the timings I made are from actual shading pass with depth test set to EQUAL)
The pixel work isn't exactly the same -- the depth-test comes after the pixel shader, and the "early" depth-test comes before. The "early" version is usually approximate, and is allowed to let through false-positives (i.e. pixels can pass early depth test, run shader, fail depth test - no optimisation). Scenes with more complicated scene depth won't benifit as much from the early-depth test.
[/quote]
Are you serious? The actual depth testing function set with glDepthFunc comes after the pixel shader? Then what point would it make to do a depth pre-pass? Moreover, I did a test with a few "ground planes" laid one in front of another, and no matter how many planes I had, 5 or 1, the performance (FPS) was pretty much the same in both cases. So it is only this "grass mesh" causing so much trouble to the pipeline.

@mhagain: you think I am measuring the render time with stuff like QPC ? I am aware that this kind of stuff only measrues CPU time and is not relevant whatsoever to what the GPU does. For timings I used http://www.opengl.org/registry/specs/ARB/timer_query.txt

##### Share on other sites
Are you serious? The actual depth testing function set with glDepthFunc comes after the pixel shader? Then what point would it make to do a depth pre-pass?
The depth test happens before and after the pixel shader. Technically, it has to behave as if it occurs after the pixel shader, but GPU manufacturers are allowed to optimise their cards so that it actually occurs before.

In practice, GPU manufacturers do the depth test both before and after. The early/before test is only an approximation -- you can think of it as a lower resolution depth buffer, which catches [size=2]e.g. 90% of 'overdrawn' pixels. The later/after test is the full accurate version, and catches the [size=2]e.g. 10% that the early test missed.

What kind of GPU are you testing on?

##### Share on other sites
Afaik NV GPUs use sort of conservative hierarchical depth testing, but I thought it catches 100% of overdrawn pixels. I realize that my knowledge regarding the GPU architecture is rather scarce, but I just can't understand why would a depth test be performed after the pixel shader, given that the pixel shader doesn't call any discard. And as I said, depth pre-pass for a couple of planes laying one over top of another works beautifully.

I use a laptop NV 540M.

##### Share on other sites
It is performed afterwards because the pre-test isn't very fine grain with its rejection.

Lets say you have a 4x4 grid of pixel, the pre-text might be performed on a 2x2 grid, with each "pixel" in the coarse test covering 4 pixels in a quad for the real screen. This means the z value stored in that 2x2 grid must be as convervative as possible in order to not get false rejection.

Post pixel shader the 'real' z-test would be performed at a more fine grain level to catch the small number of pixels which would pass the pre-test but not the post-test.

So, if your planes are all aligned you would get 'perfect' early rejection, but as soon as things aren't perfectly aligned some pixels will pass the early test which shouldn't make it to the output buffers.

##### Share on other sites
@phantom: what you're saying does seem to make much sense and answer the question why rendering of the grass with heavy pixel shaders takes so much time. It's obvious from the screenshot I provided that the grass mesh triangles are very "randomly" oriented (although Y-coordinate of their normal vectors is always 0) so many of their pixels, even if not visible to the camera, might end up in the pixel shader.

One thing that doesn't fit here is the first scenario I described with cheap pixel shaders, where the pixel processing time of the plane and grass was lower than the plane itself. Although I "estimated" the time vertex processing took for the grass simply by subtracting the time spent on rendering the grass while *not* visible on the screen. So I suppose this measurement can be biased to some extent. Still, interesting tips are welcome .

##### Share on other sites
I think your timing method is suspect; the only real way to find out what is going on is to spin up a performance analysis tool and take a look at it with that.

##### Share on other sites
My timing is purely based on http://www.opengl.org/registry/specs/ARB/timer_query.txt. I start profiling before calling glDrawElements and stop right after. Here's a class I quickly wrote:

 class CTimer { public: GLuint queries[1]; GLint available; GLuint64 timeElapsed; public: void start() { available = 0; timeElapsed = 0; // Create a query object. glGenQueries(1, &queries[0]); // Start query 1 glBeginQuery(GL_TIME_ELAPSED, queries[0]); } void stop() { glEndQuery(GL_TIME_ELAPSED); // Wait for all results to become available while (!available) { glGetQueryObjectiv(queries[0], GL_QUERY_RESULT_AVAILABLE, &available); } for (int i = 0; i < 1; i++) { // See how much time the rendering of object i took in nanoseconds. glGetQueryObjectui64v(queries[0], GL_QUERY_RESULT, &timeElapsed); // Do something useful with the time. Note that care should be // taken to use all significant bits of the result, not just the // least significant 32 bits. printf("%lu\n", timeElapsed/1000); } } }; 

I would happily use some profiling tool if I could finally manage to. I've had a couple go-rounds to NVPerfHud but never managed to get it working...

##### Share on other sites
Just to clarify the "why" of early/late Z rejection: Because there are some situations when depth test must be done after the pixel shader (shaders that output depth, in particular) - the hardware needs to be able to do the full depth test logic after that stage. Once you've made that concession, including all of the additional logic in the hardware to also be able to do it before (and switch dynamically, based on things like presence of clip instructions) isn't considered worth it. Instead, they just include the coarse early rejection hardware. Every piece of functionality costs die space - and alternate data paths like that are particularly nasty from a chip complexity standpoint. The GPU is much easier to design and optimize if it's a single long pipeline. (Which obviously isn't entirely true, but for the parts that remain fixed-function, like depth test, it's still a good tradeoff to make).

##### Share on other sites
Yeah... I guess it's much, much more complex down there at the hardware level than it is when you write your own software rasterizer running on the CPU

##### Share on other sites

Just to clarify the "why" of early/late Z rejection: Because there are some situations when depth test must be done after the pixel shader (shaders that output depth, in particular) - the hardware needs to be able to do the full depth test logic after that stage. Once you've made that concession, including all of the additional logic in the hardware to also be able to do it before (and switch dynamically, based on things like presence of clip instructions) isn't considered worth it. Instead, they just include the coarse early rejection hardware.

Except that isn't true, since modern GPU's *do* have the hardware to perform full fine-grained depth/stencil testing before execution of the pixel shader. They still have coarse-grained z/stencil rejection, since it is cheaper to reject entire tiles than it is to perform a full z-test per-pixel.

##### Share on other sites
One thing I have learned is never assume anything with modern hardware, there is just way too much going on. The only way to be sure is to capture a frame and time it through the GPU (even this has limited meaning you need to do captures from many camera locations).. At least in the world of consoles you learn so much from this about the hardware, I don't know how much fine grained data you get from a PC graphics card these days and at the end of the day you're probably not going to be trying to get the absolute last drop of performance anyway. You can only really do your best to find mistakes in what you are sending to be rendered, there is no point trying to start micro optomizing until you have at least something that represents the final scene you are rendering.

##### Share on other sites
Well, the whole thing started when I decided to add grass rendering to game, which is a top-down game. I noticed that the game's FPS dropped from 180 (where there is only ground plane visible) to 60 FPS (alpha blending, ground plane visible, grass on top of it and covers entire screen) or 80 FPS (alpha test). I could have guessed why alpha blending is such a killer here, but the time needed for alpha tested version surprised me. I expected the drop to be less dramatic. Then I turned off alpha testing (and alpha blending) - 113 FPS. So this difference, 180 - 113, lead me to more accurate profiling because in that case I expected very similar results. Now I know it's more complex that that and that depth pre-pass doesn't guarantee 100% early depth rejection .

##### Share on other sites

Well, the whole thing started when I decided to add grass rendering to game, which is a top-down game. I noticed that the game's FPS dropped from 180 (where there is only ground plane visible) to 60 FPS (alpha blending, ground plane visible, grass on top of it and covers entire screen) or 80 FPS (alpha test). I could have guessed why alpha blending is such a killer here, but the time needed for alpha tested version surprised me. I expected the drop to be less dramatic. Then I turned off alpha testing (and alpha blending) - 113 FPS. So this difference, 180 - 113, lead me to more accurate profiling because in that case I expected very similar results. Now I know it's more complex that that and that depth pre-pass doesn't guarantee 100% early depth rejection .

Sometimes its not going to be worth doing a prepass, if there is a lot of batches it could have a negative effect on CPU performance - this depends on your target platforms (modern hardware + API its less of an issue).Also you pay something in rasterising operations even if rendering fast Z for your prepass . If you instead draw as much as you can in the correct order with no prepass, with some luck you might have better performance. You might want to look at your grass asset and make sure it is optimal - as in make sure there is no wasted 0 alpha at the top of the quad.

Top down games are nice to work on in terms of performance of rendering.

##### Share on other sites

Top down games are nice to work on in terms of performance of rendering.

Indeed .

I would argue that Z-prepass *mostly* is a win. Assuming we have enough CPU to spare and a lot of pixel processing. For instance, I use forward rendering and have quite a few lights in the scene. Now, because I have depth pre-pass, for alpha tested geometry I do the alpha test and discard only *once* in the depth pre-pass. All other lighting shaders don't have to do it, cause the depth buffer is already filled correctly.