Measuring performance - overdraw

Started by
17 comments, last by maxest 12 years ago
I think your timing method is suspect; the only real way to find out what is going on is to spin up a performance analysis tool and take a look at it with that.
Advertisement
My timing is purely based on http://www.opengl.org/registry/specs/ARB/timer_query.txt. I start profiling before calling glDrawElements and stop right after. Here's a class I quickly wrote:


class CTimer
{
public:
GLuint queries[1];
GLint available;
GLuint64 timeElapsed;

public:
void start()
{
available = 0;
timeElapsed = 0;

// Create a query object.
glGenQueries(1, &queries[0]);

// Start query 1
glBeginQuery(GL_TIME_ELAPSED, queries[0]);
}

void stop()
{
glEndQuery(GL_TIME_ELAPSED);

// Wait for all results to become available
while (!available) {
glGetQueryObjectiv(queries[0], GL_QUERY_RESULT_AVAILABLE, &available);
}

for (int i = 0; i < 1; i++) {
// See how much time the rendering of object i took in nanoseconds.
glGetQueryObjectui64v(queries[0], GL_QUERY_RESULT, &timeElapsed);

// Do something useful with the time. Note that care should be
// taken to use all significant bits of the result, not just the
// least significant 32 bits.
printf("%lu\n", timeElapsed/1000);
}
}
};


I would happily use some profiling tool if I could finally manage to. I've had a couple go-rounds to NVPerfHud but never managed to get it working...
Just to clarify the "why" of early/late Z rejection: Because there are some situations when depth test must be done after the pixel shader (shaders that output depth, in particular) - the hardware needs to be able to do the full depth test logic after that stage. Once you've made that concession, including all of the additional logic in the hardware to also be able to do it before (and switch dynamically, based on things like presence of clip instructions) isn't considered worth it. Instead, they just include the coarse early rejection hardware. Every piece of functionality costs die space - and alternate data paths like that are particularly nasty from a chip complexity standpoint. The GPU is much easier to design and optimize if it's a single long pipeline. (Which obviously isn't entirely true, but for the parts that remain fixed-function, like depth test, it's still a good tradeoff to make).
Yeah... I guess it's much, much more complex down there at the hardware level than it is when you write your own software rasterizer running on the CPU :)

Just to clarify the "why" of early/late Z rejection: Because there are some situations when depth test must be done after the pixel shader (shaders that output depth, in particular) - the hardware needs to be able to do the full depth test logic after that stage. Once you've made that concession, including all of the additional logic in the hardware to also be able to do it before (and switch dynamically, based on things like presence of clip instructions) isn't considered worth it. Instead, they just include the coarse early rejection hardware.


Except that isn't true, since modern GPU's *do* have the hardware to perform full fine-grained depth/stencil testing before execution of the pixel shader. They still have coarse-grained z/stencil rejection, since it is cheaper to reject entire tiles than it is to perform a full z-test per-pixel.
One thing I have learned is never assume anything with modern hardware, there is just way too much going on. The only way to be sure is to capture a frame and time it through the GPU (even this has limited meaning you need to do captures from many camera locations).. At least in the world of consoles you learn so much from this about the hardware, I don't know how much fine grained data you get from a PC graphics card these days and at the end of the day you're probably not going to be trying to get the absolute last drop of performance anyway. You can only really do your best to find mistakes in what you are sending to be rendered, there is no point trying to start micro optomizing until you have at least something that represents the final scene you are rendering.
Well, the whole thing started when I decided to add grass rendering to game, which is a top-down game. I noticed that the game's FPS dropped from 180 (where there is only ground plane visible) to 60 FPS (alpha blending, ground plane visible, grass on top of it and covers entire screen) or 80 FPS (alpha test). I could have guessed why alpha blending is such a killer here, but the time needed for alpha tested version surprised me. I expected the drop to be less dramatic. Then I turned off alpha testing (and alpha blending) - 113 FPS. So this difference, 180 - 113, lead me to more accurate profiling because in that case I expected very similar results. Now I know it's more complex that that and that depth pre-pass doesn't guarantee 100% early depth rejection :).

Well, the whole thing started when I decided to add grass rendering to game, which is a top-down game. I noticed that the game's FPS dropped from 180 (where there is only ground plane visible) to 60 FPS (alpha blending, ground plane visible, grass on top of it and covers entire screen) or 80 FPS (alpha test). I could have guessed why alpha blending is such a killer here, but the time needed for alpha tested version surprised me. I expected the drop to be less dramatic. Then I turned off alpha testing (and alpha blending) - 113 FPS. So this difference, 180 - 113, lead me to more accurate profiling because in that case I expected very similar results. Now I know it's more complex that that and that depth pre-pass doesn't guarantee 100% early depth rejection smile.png.


Sometimes its not going to be worth doing a prepass, if there is a lot of batches it could have a negative effect on CPU performance - this depends on your target platforms (modern hardware + API its less of an issue).Also you pay something in rasterising operations even if rendering fast Z for your prepass . If you instead draw as much as you can in the correct order with no prepass, with some luck you might have better performance. You might want to look at your grass asset and make sure it is optimal - as in make sure there is no wasted 0 alpha at the top of the quad.

Top down games are nice to work on in terms of performance of rendering.

Top down games are nice to work on in terms of performance of rendering.

Indeed :).

I would argue that Z-prepass *mostly* is a win. Assuming we have enough CPU to spare and a lot of pixel processing. For instance, I use forward rendering and have quite a few lights in the scene. Now, because I have depth pre-pass, for alpha tested geometry I do the alpha test and discard only *once* in the depth pre-pass. All other lighting shaders don't have to do it, cause the depth buffer is already filled correctly.

This topic is closed to new replies.

Advertisement