• Create Account

## Large Query Count?

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

12 replies to this topic

### #13TATUK2  Members

714
Like
0Likes
Like

Posted 28 October 2013 - 06:30 PM

So, I'm doing octree occlusion which means I have to generate an occlusion query for each node... Which at a depth of 6 is >16,000....

Once I glEndQuery() my 16256th query... It crashes.

### #2Hodgman  Moderators

49410
Like
1Likes
Like

Posted 28 October 2013 - 06:34 PM

If your usage is correct, it sounds like a driver bug. What's the stack trace and the exception given when it crashes?

However:
Does every node in your tree actually contain an occludee? You can probably skip most of them. Also, you should perform frustum culling before occlusion culling, which will also reduce this number.

### #33TATUK2  Members

714
Like
0Likes
Like

Posted 28 October 2013 - 06:41 PM

I also suspected it's a driver/hardware bug...

Stack trace comes from GL driver which is basically useless:

#0  0x69e30a2c in atioglxx!atiPS () from C:\WINDOWS\SysWOW64\atioglxx.dll
#1  0x814e724c in ?? ()
#2  0x6994c4f8 in atioglxx!atiPPHSN () from C:\WINDOWS\SysWOW64\atioglxx.dll
#3  0x060f2910 in ?? ()
#4  0x6994b3c9 in atioglxx!atiPPHSN () from C:\WINDOWS\SysWOW64\atioglxx.dll
#5  0x064a4188 in ?? ()
#6  0x69088463 in atioglxx!atiPPHSN () from C:\WINDOWS\SysWOW64\atioglxx.dll
#7  0x814e7138 in ?? ()
#8  0x693f0577 in atioglxx!atiPPHSN () from C:\WINDOWS\SysWOW64\atioglxx.dll
#9  0x00000000 in ?? ()


Also, it's "hypothetically potential" that all octree nodes will be visible at one time - say you fly up into the air and look down onto the scene, or something... So I can't rely on a "workaround" like that.

### #4kauna  Members

2918
Like
0Likes
Like

Posted 28 October 2013 - 06:42 PM

Is it preferable to use the hardware for the occlusion queries? In several cases it has been stated that a software solution is actually better, although maybe more difficult to implement.

As far as I know, the problem with the hardware occlusion is that you'll have to wait some / several frames for the query results to arrive. Also, SLI/Crossfire requires separate queries created for each AFR unit.

Cheers!

### #53TATUK2  Members

714
Like
0Likes
Like

Posted 28 October 2013 - 06:45 PM

Yeah, pre-generated "potentially visible set" stuff is better but much harder to implement... I might start seriously thinking about it. Also, I use the query result from the last frame so it's not much of a problem

### #6mhagain  Members

12437
Like
4Likes
Like

Posted 28 October 2013 - 06:55 PM

You shouldn't be doing occlusion on each node.

First of all, use frustum culling to remove nodes that aren't in your view pyramid.  Don't even bother testing these for occlusion.

Secondly, the way an octree is constructed, if any given node is occluded then all of it's child nodes are also occluded (because a node contains all of it's children).  No need to test child nodes of an occluded node.

Your hypothetical case where all nodes are visible at the same time - that's a rare edge case.  It's not going to happen unless all of your nodes are on the same plane and the viewpoint is sufficiently far away to bring them all on-screen together.  I wouldn't even worry about that all.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #7Hodgman  Moderators

49410
Like
3Likes
Like

Posted 28 October 2013 - 07:57 PM

Even if all the nodes are visible, you can decide to LOD some of them and use the parent instead of the leaf nodes.
If you're not going to do any stuff that exploits the hierarchical nature of the tree (including what mhagain said above), then you've basically just got a 3D grid, not a 3D tree

The simplest pre-generated PVS system can be extremely simple though. In one game demo that I made, I simply manually created bounding volumes for different sectors, and then manually wrote into a text file lists of which sectors could be seen from which other sectors

As well as pre-generated PVS, there's lots of runtime approaches that use the CPU rather than the GPU --

• Lots of games have used sector/portal systems.
• I know of one proprietary AAA engine that just had the level designers place 3D rectangles throughout the world as occluders (e.g. in the walls of buildings). The engine would then find the ~6 largest of these rectangles that passed the frustum test, and then use them to create frustums, and then brute-force cull any object that was inside of those occluder-frustums.
• Fallout 3 used a combination of both of the above.
• Here's another example of occluder frustums: http://www.gamasutra.com/view/feature/131388/rendering_the_great_outdoors_fast_.php?print=1
• Lots of modern games have implemented software rasterizers, where they render occlusion geometry on the CPU to create a float Z-buffer in main ram, which you can then test object bounding boxes against.
• Personally, I'm using this solution, where you implement a software rasterizer, but only allocate one bit per pixel (1=written to, 0=not yet written to). A clever implementation can then write to (up to) 128 pixels per instruction using SIMD. Occludee and occluder triangles are then sorted from front to back and rasterized. Occluder triangles write 1's, occludee triangles test for 0's (and early exit if a zero is found).

On the GPU side, there's also other options than occlusion queries. e.g. Splinter cell conviction re implemented the HZB idea, doing culling in a pixel shader, allowing them to do 10's of thousands of queries in a fraction of a millisecond, on a GPU from 2006 (much, much faster than HW occluson queries, but slightly less accurate).

Stack trace comes from GL driver which is basically useless

So as long as your usage of the functions is perfectly compliant with the GL spec, then that means you can legitimately blame the driver and/or report a driver bug

Edited by Hodgman, 28 October 2013 - 08:23 PM.

### #8Ohforf sake  Members

2048
Like
2Likes
Like

Posted 29 October 2013 - 07:51 AM

You have posted a lot of threads all revolving around occlusion culling. In each you seem to target a problem, that isn't actually your problem.

Have you done any profiling or performed any measurements, that actually tell you, that you need this sort of occlusion culling? You should never do s.th. as diffuse as optimization, without some form of measurement in place which tells you, if you are heading in the right direction and if you are still on track.

Also, it's "hypothetically potential" that all octree nodes will be visible at one time - say you fly up into the air and look down onto the scene, or something... So I can't rely on a "workaround" like that.

If your game is supposed to render just fine at 30 FPS even when everything is visible, why do you even care about culling?

Doesn't 16k occlusion querries also mean 16k draw calls? If you hit 16k draw calls before the first real triangle is even rendered, you might just as well be better off without any culling.

Why don't you try some LOD and pure frustrum culling and see, how far it gets you? If you are indoor, rather then outdoor, try portal culling. It is actually quite easy to implement for handplaced cells/portals und extremely efficient.

### #93TATUK2  Members

714
Like
0Likes
Like

Posted 29 October 2013 - 08:03 PM

You have posted a lot of threads all revolving around occlusion culling. In each you seem to target a problem, that isn't actually your problem.

Yes? That's what I'm focusing on implementing now and each post is aimed at a different specific issue. It's your opinion that "they're not my problem".

Have you done any profiling or performed any measurements, that actually tell you, that you need this sort of occlusion culling?

Yes. For example on a 200K polygon map, I would get a constant 40 FPS since all vertices were transformed. With my occlusion culling I get a varying 60-150 FPS throughout the map.

If your game is supposed to render just fine at 30 FPS even when everything is visible, why do you even care about culling?

I don't care so much that it's supposed to be *fast* in such a hypothetical situation - but that it still *works*.

Doesn't 16k occlusion querries also mean 16k draw calls?

No. Once an octree node is not visible, all of it's leaves aren't processed.

Why don't you try some LOD and pure frustrum culling and see, how far it gets you?

I tried frustum culling and my current octree occlusion works better.

If you are indoor, rather then outdoor, try portal culling.

It might be "easy" - but it puts a limitation on the mapping aspect, and makes thing less flexible.

### #10Ohforf sake  Members

2048
Like
0Likes
Like

Posted 30 October 2013 - 03:45 AM

You have posted a lot of threads all revolving around occlusion culling. In each you seem to target a problem, that isn't actually your problem.

Yes? That's what I'm focusing on implementing now and each post is aimed at a different specific issue. It's your opinion that "they're not my problem".

I didn't mean to imply, that you were posting too many threads or asking too many questions. On the contrary, I would encourage you to do so. It is however my opinion (and yes, this is only my opinion and you are free to ignore it) that you might be chasing ghosts.

Have you done any profiling or performed any measurements, that actually tell you, that you need this sort of occlusion culling?

Yes. For example on a 200K polygon map, I would get a constant 40 FPS since all vertices were transformed. With my occlusion culling I get a varying 60-150 FPS throughout the map.

This is a good start, but can you try to get more into the details? Without occlusion culling you get 40FPS which equals 25ms for 200k tris. Is this 25ms on the CPU or on the GPU and are you sure, that this is vertex transformation time? Can you tilt the camera in a way, that no triangles hit the screen and measure again? The OpenGL time querries are an extremely cool tool to measure GPU times and latencies.

To give you a reference, my notebook GPU can render a ~700k triangle terrain in ~1ms as long, as no triangle actually hits the screen. So in terms of pure vertex and polygon processing, it should be able to render 17500k triangles in those 25ms, not 200k. You will probably never reach the vertex efficiency of a terrain with regular models, but still, with 200k tris in 25ms I find it hard to believe, that you are actually vertex bound.

Doesn't 16k occlusion querries also mean 16k draw calls?

No. Once an octree node is not visible, all of it's leaves aren't processed.

Are you using s.th. like the concept proposed in GPU Gems?
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter06.html

Because they still perform blocking waits on query results in there, which might end up being a PITA once timings start to shift around, due to code, content or hardware changes.

If you are indoor, rather then outdoor, try portal culling.

It might be "easy" - but it puts a limitation on the mapping aspect, and makes thing less flexible.

We went down the same road for our last commercial game, for the same reasons (reduce the workload for the level designers), and boy did we regret that decision.

### #113TATUK2  Members

714
Like
0Likes
Like

Posted 30 October 2013 - 05:27 AM

Using a time query, I get 8 ms/frame when no tris hit the screen. Maybe your card is just 8x better than mine :> What kind of GPU is that? Mine is a really low end card, radeon HD 5450... Curious to compare GPU benchmark stats from videocardbenchmark.net . . . I get a G3D Mark of 234 . . . So I'm "pretty sure" this is vertex bound. After all, all vertices sent through still need to be processed and transformed through the vertex shader before it can even be determined that they're offscreen to bypass the fragment shader.

Yeah, I'm glad that it *works* - but slightly disappointed it doesn't really come close to good engines' optimizations. Like I've mentioned before as a comparison, the darkplaces quake engine that xonotic runs with this same exact map on my system runs at 200+ FPS. From what I've heard they use a combination of portals, and pre-baked PVS

### #12Ohforf sake  Members

2048
Like
0Likes
Like

Posted 30 October 2013 - 06:05 AM

I have a GeForce 660M which actually is quite a lot faster.
http://www.notebookcheck.net/NVIDIA-GeForce-GTX-660M.71859.0.html

Is that radeon card the mobile version or the desktop version?

8ms is still a bit slow, but already three times faster then the 25ms which I assume you measured on the CPU. This means that you are CPU bound. Unless this is due to some crazy physics or fluid simulations running in the background, your draw calls are too expensive. Occlusion culling does help here (as long as it reduces the draw call count) but you should double check your rendering code first.

Interesting things to consider are: How many draw calls are there per frame (or on averge how many triangles per draw call) and how many state changes to set them up? What are the hotspots that the profiler reports? How much time is spend in the application, and how much in the OpenGL runtime? Are there any blocking requests that wait on the GPU (these can be very subtile and easy to overlook like a simple glGetError() or a query that wasn't ready)?

### #133TATUK2  Members

714
Like
0Likes
Like

Posted 30 October 2013 - 06:49 AM

I have the HD 5450 desktop card...

The difference between my 8ms report and the 25ms is for a few reasons. . . First off, I was doing a depth pre-pass which basically doubles the amount of vertex transformation time, so that's up to 16 ms right there, secondly the non-rendering part of the CPU frame takes about 4 ms so you're up to 20, and the 25ms report was based on an older version of my map which was actually closer to 250K instead of this latest one which is lower at about 200K.

So the 25ms is still "accurate" accounting for those factors and still shows that it's GPU bound.

I really think it's just slower than your card and that's all. . .

My octree occlusion implementation has essentially been unchanged for about two or three days now, and I really believe it's probably "as good as it'll get". Pre-baked PVS is "just that much" better.  Like I said I get about 2x better performance on this same map in the xonotic engine . . . Because of PVS.

I've been thinking about how to potentially even start a PVS implementation, and it essentially amounts to something like this:

For each base octree node, or even spatial grid cube maybe . . .

For each of 6 camera view frustum directions . . .

Render each individual triangle wrapped in an occlusion query . . .

If the triangle passes, then add it to the list of "visible triangles" in that 1/6 frustum section for that octree node/spatial grid cube

Then. . . during runtime, determine which octree node/spatial grid cube the camera is in... intersect the camera frustum with the 6 view directional frustums to get (i guess) at most which 3 frustums are potentially visible, then render all triangles in the "visible triangles" lists of the frustums which are intersected . . .

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.