Jump to content

  • Log In with Google      Sign In   
  • Create Account

Fast Way To Determine If All Pixels In Opengl Depth Buffer Were Drawn At Least Once?

  • You cannot reply to this topic
19 replies to this topic

#1   Members   

121
Like
0Likes
Like

Posted 24 July 2016 - 01:53 PM

Hello, I am programming a FPS game and I simply want to make it faster. I tried a lot of things, from which few of them worked. My testing map has 15k vertexes and 26k triangles, and I am using partitioning space by X*Y*Z orthogonal cubes. Thats fine cause it works. Next thing that helped a lot was to order partitions by metric from partition where I am and display them from nearest, which caused OpenGl to not overdraw it so much. Also I use face culling and my own per-triangle frustum clipping, and also overdraw check that makes it sure that same triangle is rendered only once. Also before loading a map, the triangle lists of each partition are ordered by texture to minimalize need of switching glEnd and glBegin, which caused big slowdown. Also, I tried arranging triangles into triangle strips, but they suck arse, and also I tried using VBOs instead glBegin and glEnd, but it didnt helped that much as internet promised. My next, yet undone idea is to compute all this stuff just when the player position and rotation changes, and if not just render it as before without any computations.

 

Anyway, in every tic I count number of triangles being actually drawn. As a testing map I use Hell Gate from Quake III (that one with crazy mouth in a room, if somebody knows that one :D ) loaded from exported .obj file. When I am on the end of map and looking outside (in the mouth), 36 triangles are being drawn, which seems fair to me. However, turning by 180 degrees causes me to look INSIDE the map and my frustum to contain nearly all partitions, and 22k triangles from 26k are displayed. Now I get to my idea - display few partitions, then CHECK IF ALL PIXELS WERE DISPLAYED, if not, display some more partitions, and so on. That could make it really fast, cause it would cut of everything excpet the first room. Problém is that extracting depth buffer and checking all the 1920x1080 of that little guys is so slow that it would be contraproductive (proven by try).

 

So my question is - is there actually a FAST way how to check if all pixels are rendered at least once? (= if the depth value is not 127 anywhere) I did like 3-hours research which didnt found answer. Also, if people here will say "no" it will encourage me in writing my own rasterization (at least I will have totally full control).



#2   Members   

121
Like
0Likes
Like

Posted 24 July 2016 - 03:47 PM

Also, for illustration go to 2:17 :

 

https://www.youtube.com/watch?v=qA49I-P6DDQ



#3   Moderators   

49385
Like
2Likes
Like

Posted 24 July 2016 - 06:16 PM

There's a hardware feature called 'occlusion queries', which do exactly what you're looking for -- determine a yes/no answer to whether something was drawn or not. To find out if there's "holes" in the depth buffer, you can draw a quad that's very far away using an occlusion query, and check if the result is "yes - the quad was visible".
 

Now I get to my idea - display few partitions, then CHECK IF ALL PIXELS WERE DISPLAYED, if not, display some more partitions, and so on. That could make it really fast, cause it would cut of everything excpet the first room. Problém is that extracting depth buffer and checking all the 1920x1080 of that little guys is so slow that it would be contraproductive (proven by try).

 A bigger problem is that the CPU and GPU have a very large latency between them. When you call any glDraw function, the driver is actually writing a command packet into a queue (like networking!), and the GPU might not execute that command until, say, 30ms later. This is perfectly fine in most situations, as the CPU and GPU form a pipeline with huge throughput, but long latency.
e.g. a healthy timeline looks like:

CPU: | Frame 1 | Frame 2 | Frame 3 | ...
GPU: | wait    | Frame 1 | Frame 2 | ...

 
If you ever try to read GPU data back to the CPU during a frame -- e.g. you split your frame into two parts (A/B) with a read-back operation in between them, you end up with a timeline like this:

CPU: | Frame1.A | wait      |Copy| Frame1.B | Frame2.A | wait      |Copy| Frame2.B | Frame3.A | ...
GPU: | wait     | Frame 1.A |Copy| wait     | Frame1.B | Frame 2.A |Copy| wait     | Frame2.B | ...

Now, both the CPU and GPU spend roughly half of the time idle, waiting on the other processor.
If you're going to read back GPU data, you need to wait at least one frame before requesting the results, to avoid causing a pipeline bubble :(
That means that reading back GPU data to use in CPU-driven occlusion culling is a dead-end for performance.


Edited by Hodgman, 24 July 2016 - 06:32 PM.


#4   Members   

1627
Like
1Likes
Like

Posted 25 July 2016 - 02:51 AM

Writing your own rasterizer isn't really going to solve your problem, since you won't be utilizing your GPU at all (or if you use compute, not as efficiently as you could be). Just leave that stuff to the GPU guys, they know what they're doing. :)

 

Anyway, do you really need such precise culling? I mean, are you absolutely sure you're GPU bound? Going into such detail just to cull a few triangles might not be worth it, and could hurt your performance rather than help if you're actually CPU bound since modern GPUs prefer to eat big chunks of data more than they like to issue a draw call for each individual triangle. If you have bounding box culling on your objects, and frustum culling, then I think that's all you'll really need unless you're writing a big AAA title with a very high scene complexity.

 

Just bear in mind that Quake levels were built for some different hardware constraints, so you should probably break up the obj model you have into small sections to avoid processing the entire mesh in one chunk so that you can leverage those two culling systems a little more.

 

That said, if you really want to have some proper occlusion culling for triangles, you can either check out the Frostbite approach (it's quite complicated iirc), or try implementing a simple Hi-Z culling system using Geometry Shaders (build a simple quad-tree out of your zbuffer and do quad-based culling on each triangle using the geometry shader). The later is simpler to implement and I've had pretty good results with it.


Edited by Styves, 25 July 2016 - 03:50 AM.


#5   Members   

121
Like
0Likes
Like

Posted 25 July 2016 - 07:10 AM

So I tried occlusion culling. Principially it works, but guess what. :D

 

It made it slower. Initially I got framerate 42. When I try to do test of gl_samples count every 20 partitions = 38 fps. Every 10 partitions = 32 fps :(

Making list of displayed triangles in previous frame helped, when I dont move - I get 41, but when I move, it for sure drops on 32.

And yes I think it is worth, cause I see like 5k triangles max. When I have to pass 15k triangles to OpenGl more, it is very noobish performance leak.

Anyway thank you guys.



#6   Moderators   

49385
Like
0Likes
Like

Posted 25 July 2016 - 08:05 AM

"Thousands of triangles" should not be something that alters the framerate so dramatically. A modern game can draw hundreds of thousands of triangles at 1000fps -- which is one of the reasons that VBO's replaced begin/end (same number of gl function calls required for any number of triangles).

 

You need to profile your game to find out where the time is being spent. Set up a class that records the high-frequency timer at two points in time, subtracts the difference, and logs the result, and then put instances of this class in any function that you think might be a performance hog.

It's common to do this with a constructor/destructor:

struct ProfileLogger { ProfileLogger(const char* name) { PushProfileScope(name); } ~ProfileLogger() { PopProfileScope(); } };
#define Profile(name) ProfileLogger _profile_(name);
 
void Test()
{
  Profile("Test");// calls PushProfileScope("Test") here
  DoStuff();
}// calls PopProfileScope("Test") here

From this data you can get a hierarchical breakdown of where all your CPU time is spent per frame. Trying to optimize without this data is just shooting in the dark.

 

From the sounds of it, your game is almost certainly CPU-bound, so you can start here. Later on though, you can use gl timer queries to do the same thing on the GPU side -- wrapping parts of the scene in two timer queries to find out how long it took the GPU to process those commands.



#7   Members   

121
Like
0Likes
Like

Posted 26 July 2016 - 01:46 AM

Displaying is the actual bottleneck, cause I have done this stuff before. What I didnt knew was that on testing computer there was Nvidia set to best antialiasing and texture filtering, so I turned them off and now it runs stably on 60fps without any visible change to worse :)

 

... anyway, if the occlusion culling queries are so slow, what is their point then? Will for example GL_ANY_SAMPLES_PASSED_CONSERVATIVE speed it up? Another idea is lowering viewport resolution when doing queries and then setting it back to normal ...



#8   Moderators   

49385
Like
0Likes
Like

Posted 26 July 2016 - 03:18 AM

The problem with this use of occlusion queries is the pipeline bubble caused by reading back results on the same frame. Even if the query itself is free, this bubble will halve your framerate.

 

Reading back an occlusion query is fine if you wait one frame before requesting the result, as this won't disrupt the pipeline. This is useful for things like lens flares or special effects where you don't care about the data being one frame old, but is dangerous for deciding what parts of the scene to draw :(

Generally they're a pretty useless API feature...

 

In a modern engine though, you can move all your culling and "what to draw" code off the CPU and into a compute shader. You can then issue draw-indirect commands to say "you will be drawing something, but I don't know which triangles, yet. The number and offset will be present in this buffer later (which is filled in by the compute shader).

 

For something like a quake3 level though, you should be able to have a constant draw cost regardless of how many triangles are visible. The level triangle data is static, so put it in a VBO once and never update it again.



#9   Members   

121
Like
0Likes
Like

Posted 27 July 2016 - 02:38 AM

That Hi-Z culling looks promising, I will leave it as a backup idea for optimization. I dont doubt that lighting model will fuck up the framerate significantly when I will do it, so there for sure will be need for any optimizing stuff that works. But since I have already 60fps now and people around me demand mainly the basic functionality ("How its going with a game?" "I made it faster" "And can I play it?" "Not yet"), I must move on networking and multiplayer ASAP.

 

By the way I dont think that constant draw cost is good. I could measure time of the loop and if some time will remain, then do some filler work like pre-loading chunk of next map to second map buffer or something :)



#10   Members   

12429
Like
0Likes
Like

Posted 29 July 2016 - 02:50 AM

Since you're using a Quake 3 map, the classic Quake method of solving this was to precompute a potentially visible set (or PVS) using an offline pre-processor, then do checks against that PVS at runtime to determine what should (or should not) be drawn.

 

In (very basic) outline, you divide your map into what I'll call "areas"; these could be nodes/leafs in a BSP tree (which was what Quake used), rooms, cubes in a grid, whatever.  Then for each such area, you use some brute-force method to determine what other areas are potentially visible from it (I believe that the "potentially" part is on account of some coarseness in the algorithm, as well as the fact that this stage ignores frustum culling which is still done at runtime).  Store out the result in some fast and compact data format (Quake used a bitfield array).  Then at runtime you're just looking up those stored results, draw calls, overdraw, etc all go down, the map runs faster, and everbody is happy.

 

The downside is that the pre-processing can take time, needs to be re-run even if you make trivial changes to your map, and needs a custom map format to store the data.  And while we're on the subject of formats, .obj is a horrible, horrible, horrible, horrible, horrible format to use for game maps.  The only reason to use it is if you really love writing text parsers.  The ideal format is where you memory-map a file, read some headers to set up some sizes, then glBufferData the rest.  Simple, quick to load, no faffing about.  And while we're on the subject of glBufferData, if your observation is that VBOs are slower than glBegin/glEnd, then you're using them wrong: probably by writing a glBegin/glEnd-alike wrapper around the VBO API.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#11   Members   

121
Like
0Likes
Like

Posted 04 September 2016 - 03:39 AM

mhagain:

 

You understood me bad - I am loading from .obj just initially after I model some thing in Cinema4D and export it. Then I have a function in my engine that already saves a map in my own format, which saves every number as bytes, plus there is also space partitioning saved in that file. I dont know what exactly you mean with memory-mapping a file, but from name it sounds that it would only work for static arrays of classes, and not for linear linked list of dynamically allocated classes as I have. I like .obj cause its intuitive and readable and there is no reverse engineering about format specification.

 

PVS are really good idea, I like it. I will definitely try it. I am thinking of making special class for that, which is some box area and list of other box areas that are potentially visible. In map editor, the user will choose these manually so there will be no need to make some extra complicated code that determines what is visible from where (even if THIS is exactly the point where occlussion culling would be useful...) and also no need to compile a map like this after every change.



#12   Members   

2042
Like
0Likes
Like

Posted 08 September 2016 - 11:21 PM

 

My testing map has 15k vertexes and 26k triangles, and I am using

 

You are looking for optimizations in the wrong place if you are getting such low framerates, with or without any texture filtering. This scene is on my old Radeon 7870 video card from 3 years ago:

 

This is 3,000 plants drawn on top of a terrain with anisotropic texture filtering set to highest, or at least medium. So we are talking about 1 million triangles and plenty of overdraw between the plants. FPS was ... I don't remember around 60 though. You have other issues to sort out depending on how good your hardware is. 30 FPS is bad. 60fps sounds like you may have vsync on so it is capping it at 60. Typically vsync will cap framerates to either 30 or 60, nothing in between. At 56 fps you could be capped down to 30 potentially.

 

http://orig04.deviantart.net/7e28/f/2014/271/0/3/desert2_by_dpadam450-d80w64l.jpg


Edited by dpadam450, 08 September 2016 - 11:23 PM.


#13   Members   

121
Like
0Likes
Like

Posted 20 September 2016 - 02:41 AM

dpadam450:

 

Comp where it was tested is Asus X751L. Has Nvidia GeForce 820M, Intel Core i7-4500U 1.8GHz, 12GB RAM. I think hardware should be enough OK.

And yes I am using vsync, thats why I get 60fps. Previously I heard that 25fps is OK, but truth is that you feel subliminaly that something is wrong. I choose 60fps. The texture of triangles was just 32x32. When I put there some 256x256 texture, framerate is same (so there is so far no need for mipmapping).

 

Your image looks nice, I even see you have some shadows. Perhaps I could make similar benchtest like that and test on it.

I got next idea that I will implement some multithreading (theoretically less then 4x speedup when done correctly). Do you guys use it in your games?



#14   Members   

566
Like
0Likes
Like

Posted 20 September 2016 - 05:02 AM

I choose 60fps. The texture of triangles was just 32x32. When I put there some 256x256 texture, framerate is same (so there is so far no need for mipmapping).

 

Any other good reasons for this ? Actually you might not see any differences, but maybe later, you will notice a performance drop, and you'll spend days (and certainly a lot more) to detect that this is due to the fact that you don't use mipmapping. One big reason for this is that you currently use vsync. So as long as your GPU can afford things, you'll have 60 fps, but once it won't, you won't get 59 fps or so, you'll end up with 30 fps...

Mipmaps is just a factor of few more lines in your code.

Of course, it (almost) double the memory requirement for all your images. But except if you really target to use most of your VRAM for your geometry (and other buffers), I don't see any real good motivations by not using mipmapping.



#15   Members   

12429
Like
0Likes
Like

Posted 20 September 2016 - 08:08 AM

Mipmapping is not only a performance thing; it's also affects image quality, and in fact the primary reason why mipmapping was invented in the first place was for quality.

 

https://en.wikipedia.org/wiki/Mipmap

 

 

They are intended to increase rendering speed and reduce aliasing artifacts. .....  Mipmapping was invented by Lance Williams in 1983 and is described in his paper Pyramidal parametrics. From the abstract: "This paper advances a 'pyramidal parametric' prefiltering and sampling geometry which minimizes aliasing effects and assures continuity within and between target images." The "pyramid" can be imagined as the set of mipmaps stacked on top of each other.

 

I suggest that you do some research on aliasing to fully understand the problems that this solves.  Also be aware that to some people, aliasing may be confused with additional detail.

 

Mipmaps don't use almost double the memory - they use one-third extra.  But don't get fooled into thinking that memory usage is a primary arbiter of performance, because it's not.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#16   Members   

566
Like
0Likes
Like

Posted 21 September 2016 - 12:56 AM

Mipmapping is not only a performance thing; it's also affects image quality

 

Right. But one could notice aliasing effects more easily. Aliasing on a textured object could probably means a texture issue (or depth issue, right). But a sudden FPS drop can have many reasons.

 

Mipmaps don't use almost double the memory - they use one-third extra.

 

Right too. I'm still wondering how I could reach such a result...



#17   Members   

121
Like
0Likes
Like

Posted 29 September 2016 - 01:41 AM

How much faster should be VBO compared to glBegin/glEnd?

 

Cause with glBegin/glEnd I got 41fps and with VBO like 43fps in same point. So yes, it IS faster, but honestly I expected more improvement, since its one of main tips people give you on optimization of displaying. The speedup will probably vary depending on type of rendered stuff, but I dont know where it helps most. Another 2 problems (minor problems, I can survive these) is ugly syntax compared to glBegin/End and fact that when you create VBO for some object that changes geometry, the VBO doesnt change with position of vertexes, so you must reupdate positions that are previously copied into the data buffer.



#18   Members   

566
Like
0Likes
Like

Posted 29 September 2016 - 02:24 AM

the VBO doesnt change with position of vertexes, so you must reupdate positions that are previously copied into the data buffer.

 

The best thing, as far as you can do it, is to let the vertex shader do these calculations.

 

How much faster should be VBO compared to glBegin/glEnd?

 

It depends on many things. And this might be related to the fact that you update the VBO each frame with new vertex positions.

 

If you want/can live with GL immediate mode, then why not. But you must be aware that immediate mode is deprecated since GL 3. For example your code won't work on Apple machines, and on mobile devices neither.



#19   Members   

12429
Like
0Likes
Like

Posted 29 September 2016 - 02:48 AM

How much faster should be VBO compared to glBegin/glEnd?
 
Cause with glBegin/glEnd I got 41fps and with VBO like 43fps in same point. So yes, it IS faster, but honestly I expected more improvement...


This is the naive expectation.  As Silence correctly observes, if your VBO implementation is just a glBegin/glEnd-like wrapper around the VBO API, or if you're updating data each frame, glBegin/glEnd will often outperform it.  It's common to see bad VBO usage actually perform worse.  An alternative for the dynamic data requirement is to use client-side vertex arrays.

 

It's also the case that your actual bottleneck may be elsewhere.  GPU pipelines are very deep and using a VBO just addresses performance at one very small part of them.  If you're not actually transferring much data to the GPU (and 15k vertices is not much) then using a VBO, even in the best case scnario, isn't going to give you much perf gain, if any.


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#20   Members   

121
Like
0Likes
Like

Posted 14 October 2016 - 09:15 AM

Recently I have done the networking and tested it by first deathmatch ever. But the performance on other comps than mine is still one of big problems. I figured out, that when I replace item models with simplified ones, I get OK fps, but thats a wimpy solution.

Also - mipmaps have been implemented. Adjustable LOD bias has no effect on performance, even if set as high (like 9) that textures fade into single color.

It is sow really in glBegin/glEnd when switching textures. Texture sorting helped a bit, but not totally. So I packed all my textures to one big, and before compiling a map i will do just linear fix for texcoords to make their xy ranges in big texture. Trouble is that it will not repeat or wrap anymore, and I need that. Texcoords outside 0..1 will continue in big map through other textures, which is what I dont want. I was looking on Internet for solutions, but there is nothing. My idea is to somehow get into place in memory where texture is stored and manually hack width and height and starting pointer, to make it think that its actually just a region of itself. There will be complication with mipmaps, but thats not the immediate problem. Also textures will have to be stored not in image as I see it, but lineary after rows, so program could read it. I have dilema if I should go for it or no, cause maybe there is function like that - I just looked not enough. Anyone knows a functions that draws only region of texture? I care only about those WITH PRESERVED ABILITY TO WRAP OR REPEAT - the answer TexCoord2f(0.5,0.5) really is not what I am looking for :D