Fast Way To Determine If All Pixels In Opengl Depth Buffer Were Drawn At Least Once?

Started by
18 comments, last by Macin Software 7 years, 6 months ago

Hello, I am programming a FPS game and I simply want to make it faster. I tried a lot of things, from which few of them worked. My testing map has 15k vertexes and 26k triangles, and I am using partitioning space by X*Y*Z orthogonal cubes. Thats fine cause it works. Next thing that helped a lot was to order partitions by metric from partition where I am and display them from nearest, which caused OpenGl to not overdraw it so much. Also I use face culling and my own per-triangle frustum clipping, and also overdraw check that makes it sure that same triangle is rendered only once. Also before loading a map, the triangle lists of each partition are ordered by texture to minimalize need of switching glEnd and glBegin, which caused big slowdown. Also, I tried arranging triangles into triangle strips, but they suck arse, and also I tried using VBOs instead glBegin and glEnd, but it didnt helped that much as internet promised. My next, yet undone idea is to compute all this stuff just when the player position and rotation changes, and if not just render it as before without any computations.

Anyway, in every tic I count number of triangles being actually drawn. As a testing map I use Hell Gate from Quake III (that one with crazy mouth in a room, if somebody knows that one :D ) loaded from exported .obj file. When I am on the end of map and looking outside (in the mouth), 36 triangles are being drawn, which seems fair to me. However, turning by 180 degrees causes me to look INSIDE the map and my frustum to contain nearly all partitions, and 22k triangles from 26k are displayed. Now I get to my idea - display few partitions, then CHECK IF ALL PIXELS WERE DISPLAYED, if not, display some more partitions, and so on. That could make it really fast, cause it would cut of everything excpet the first room. Problém is that extracting depth buffer and checking all the 1920x1080 of that little guys is so slow that it would be contraproductive (proven by try).

So my question is - is there actually a FAST way how to check if all pixels are rendered at least once? (= if the depth value is not 127 anywhere) I did like 3-hours research which didnt found answer. Also, if people here will say "no" it will encourage me in writing my own rasterization (at least I will have totally full control).

Advertisement

Also, for illustration go to 2:17 :

There's a hardware feature called 'occlusion queries', which do exactly what you're looking for -- determine a yes/no answer to whether something was drawn or not. To find out if there's "holes" in the depth buffer, you can draw a quad that's very far away using an occlusion query, and check if the result is "yes - the quad was visible".

Now I get to my idea - display few partitions, then CHECK IF ALL PIXELS WERE DISPLAYED, if not, display some more partitions, and so on. That could make it really fast, cause it would cut of everything excpet the first room. Problém is that extracting depth buffer and checking all the 1920x1080 of that little guys is so slow that it would be contraproductive (proven by try).

A bigger problem is that the CPU and GPU have a very large latency between them. When you call any glDraw function, the driver is actually writing a command packet into a queue (like networking!), and the GPU might not execute that command until, say, 30ms later. This is perfectly fine in most situations, as the CPU and GPU form a pipeline with huge throughput, but long latency.
e.g. a healthy timeline looks like:


CPU: | Frame 1 | Frame 2 | Frame 3 | ...
GPU: | wait    | Frame 1 | Frame 2 | ...


If you ever try to read GPU data back to the CPU during a frame -- e.g. you split your frame into two parts (A/B) with a read-back operation in between them, you end up with a timeline like this:


CPU: | Frame1.A | wait      |Copy| Frame1.B | Frame2.A | wait      |Copy| Frame2.B | Frame3.A | ...
GPU: | wait     | Frame 1.A |Copy| wait     | Frame1.B | Frame 2.A |Copy| wait     | Frame2.B | ...

Now, both the CPU and GPU spend roughly half of the time idle, waiting on the other processor.
If you're going to read back GPU data, you need to wait at least one frame before requesting the results, to avoid causing a pipeline bubble :(
That means that reading back GPU data to use in CPU-driven occlusion culling is a dead-end for performance.

Writing your own rasterizer isn't really going to solve your problem, since you won't be utilizing your GPU at all (or if you use compute, not as efficiently as you could be). Just leave that stuff to the GPU guys, they know what they're doing. :)

Anyway, do you really need such precise culling? I mean, are you absolutely sure you're GPU bound? Going into such detail just to cull a few triangles might not be worth it, and could hurt your performance rather than help if you're actually CPU bound since modern GPUs prefer to eat big chunks of data more than they like to issue a draw call for each individual triangle. If you have bounding box culling on your objects, and frustum culling, then I think that's all you'll really need unless you're writing a big AAA title with a very high scene complexity.

Just bear in mind that Quake levels were built for some different hardware constraints, so you should probably break up the obj model you have into small sections to avoid processing the entire mesh in one chunk so that you can leverage those two culling systems a little more.

That said, if you really want to have some proper occlusion culling for triangles, you can either check out the Frostbite approach (it's quite complicated iirc), or try implementing a simple Hi-Z culling system using Geometry Shaders (build a simple quad-tree out of your zbuffer and do quad-based culling on each triangle using the geometry shader). The later is simpler to implement and I've had pretty good results with it.

So I tried occlusion culling. Principially it works, but guess what. :D

It made it slower. Initially I got framerate 42. When I try to do test of gl_samples count every 20 partitions = 38 fps. Every 10 partitions = 32 fps :(

Making list of displayed triangles in previous frame helped, when I dont move - I get 41, but when I move, it for sure drops on 32.

And yes I think it is worth, cause I see like 5k triangles max. When I have to pass 15k triangles to OpenGl more, it is very noobish performance leak.

Anyway thank you guys.

"Thousands of triangles" should not be something that alters the framerate so dramatically. A modern game can draw hundreds of thousands of triangles at 1000fps -- which is one of the reasons that VBO's replaced begin/end (same number of gl function calls required for any number of triangles).

You need to profile your game to find out where the time is being spent. Set up a class that records the high-frequency timer at two points in time, subtracts the difference, and logs the result, and then put instances of this class in any function that you think might be a performance hog.

It's common to do this with a constructor/destructor:

struct ProfileLogger { ProfileLogger(const char* name) { PushProfileScope(name); } ~ProfileLogger() { PopProfileScope(); } };
#define Profile(name) ProfileLogger _profile_(name);
 
void Test()
{
  Profile("Test");// calls PushProfileScope("Test") here
  DoStuff();
}// calls PopProfileScope("Test") here

From this data you can get a hierarchical breakdown of where all your CPU time is spent per frame. Trying to optimize without this data is just shooting in the dark.

From the sounds of it, your game is almost certainly CPU-bound, so you can start here. Later on though, you can use gl timer queries to do the same thing on the GPU side -- wrapping parts of the scene in two timer queries to find out how long it took the GPU to process those commands.

Displaying is the actual bottleneck, cause I have done this stuff before. What I didnt knew was that on testing computer there was Nvidia set to best antialiasing and texture filtering, so I turned them off and now it runs stably on 60fps without any visible change to worse :)

... anyway, if the occlusion culling queries are so slow, what is their point then? Will for example GL_ANY_SAMPLES_PASSED_CONSERVATIVE speed it up? Another idea is lowering viewport resolution when doing queries and then setting it back to normal ...

The problem with this use of occlusion queries is the pipeline bubble caused by reading back results on the same frame. Even if the query itself is free, this bubble will halve your framerate.

Reading back an occlusion query is fine if you wait one frame before requesting the result, as this won't disrupt the pipeline. This is useful for things like lens flares or special effects where you don't care about the data being one frame old, but is dangerous for deciding what parts of the scene to draw :(

Generally they're a pretty useless API feature...

In a modern engine though, you can move all your culling and "what to draw" code off the CPU and into a compute shader. You can then issue draw-indirect commands to say "you will be drawing something, but I don't know which triangles, yet. The number and offset will be present in this buffer later (which is filled in by the compute shader).

For something like a quake3 level though, you should be able to have a constant draw cost regardless of how many triangles are visible. The level triangle data is static, so put it in a VBO once and never update it again.

That Hi-Z culling looks promising, I will leave it as a backup idea for optimization. I dont doubt that lighting model will fuck up the framerate significantly when I will do it, so there for sure will be need for any optimizing stuff that works. But since I have already 60fps now and people around me demand mainly the basic functionality ("How its going with a game?" "I made it faster" "And can I play it?" "Not yet"), I must move on networking and multiplayer ASAP.

By the way I dont think that constant draw cost is good. I could measure time of the loop and if some time will remain, then do some filler work like pre-loading chunk of next map to second map buffer or something :)

Since you're using a Quake 3 map, the classic Quake method of solving this was to precompute a potentially visible set (or PVS) using an offline pre-processor, then do checks against that PVS at runtime to determine what should (or should not) be drawn.

In (very basic) outline, you divide your map into what I'll call "areas"; these could be nodes/leafs in a BSP tree (which was what Quake used), rooms, cubes in a grid, whatever. Then for each such area, you use some brute-force method to determine what other areas are potentially visible from it (I believe that the "potentially" part is on account of some coarseness in the algorithm, as well as the fact that this stage ignores frustum culling which is still done at runtime). Store out the result in some fast and compact data format (Quake used a bitfield array). Then at runtime you're just looking up those stored results, draw calls, overdraw, etc all go down, the map runs faster, and everbody is happy.

The downside is that the pre-processing can take time, needs to be re-run even if you make trivial changes to your map, and needs a custom map format to store the data. And while we're on the subject of formats, .obj is a horrible, horrible, horrible, horrible, horrible format to use for game maps. The only reason to use it is if you really love writing text parsers. The ideal format is where you memory-map a file, read some headers to set up some sizes, then glBufferData the rest. Simple, quick to load, no faffing about. And while we're on the subject of glBufferData, if your observation is that VBOs are slower than glBegin/glEnd, then you're using them wrong: probably by writing a glBegin/glEnd-alike wrapper around the VBO API.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement