Jump to content

  • Log In with Google      Sign In   
  • Create Account


  • You cannot edit this iotd

One Billion Polys  by Krypt0n    -----

 



Time Spent: some months, mostly optimization
Date Added: Jun 13 2011 05:08 PM

ok ok, quite a catchy name. what you actually see is my culling lib, supporting

-frustum culling


-occlusion culling

you see a cube consisting of 135k objects, slightly over one billion polys made of 3 mesh types

-utah teapot, about 500poly, mostly contributing to drawcalls

-stanford bunny, 15k poly, simulating nowadays game models

-tetrahedron, highly tessellated, simulating micro polygons (which are quite challenging for software rasterizers)

some benchmarks, without/with the lib

- 135k object, 1Billion polys -> 0.3fps/5-6fps

- 13k objects, 100MPoly -> 3fps/30fps


-1.3k objects, 10MPoly -> 30fps, 65fps

-130 objects, 1MPoly -> 111fps/105fps (about 3-5objects get culled, kinda useless lib in this case, creating just overhead)




there are just a few occlusion libs, I wanted to create an alternative that have some quality goals:

- handling high object counts, that's usually the limiting factor on PC games using D3D nowadays

- drop in solution, handling all culling you need, without much coding or level setup. supporting ogl/d3d (you can just call glget.. to get matrices and pass them to the lib)

- handling low poly (boxes) to Million poly meshes, with a win if culling is possible, but without performance penalty if it's impossible.

- simple c interface, few functions, supporting "scenes" and "cameras" with shared meshes, but instanced objects. so you can make one cam, one scene, thousands objects games, but also split screen games with shared cams, you can use the cams for occlusion culling of shadows, or in some rare cases also simulate separate scenes.




the scene from the screenshot is quite a worst case for the culling lib, high poly and extreme drawcall counts, low pixel cost (opengl phong shading), and it still is a win or at least not a big hit. Common in nowadays games are rather high pixel cost with 1000-5000 drawcalls, which makes this tech very beneficial for real world scenarios.







if you have any interest to integrate this lib into your existing project, drop me a pm please













 
SSE(1/2/4) was the most important "tool"

it's written internally in c++, using visual studio on win, xcode on mac and vi+make on linux.






  • You cannot edit this iotd

Share:


18 Comments

This is amazing to me though that may be cause I haven't messed with 3D in a while. It reminds me of one of the Silicon Graphics workstation demos from (I think) around 2000 which had a 3D oil rig model with (IIRC) around 2 billion polys.
Pretty cool!

What methods are you using for the occlusion culling?
For once technology as reached the point where we are able to make a cube out of a 135,000 teapots. :)

Seriously, awesome work! What are you planning to do from here?
Game needs to be made purely out of teapots.
thanks guys :)




@Hodgman

it's a software rasterizer with triangle and quad support and some obvious optimizations like removing duplicated vertices etc.

in general the lib is working like a physics/collision lib, the camera is collided against the objects in the broad phase and the fine grain phase is checking objects against the software zbuffer.




@owl @Tachikoma

can't do anything wrong with teapots, everyone love them, right? :) actually I was thinking of using just teapots, dynamically tesselated with catmull clark, but that would kinda move the whole scope of this demo, maybe the next IOTD




@owl

I hoped some people want to use the lib for their project (and I got some pm which seem promising), as it is quite a basic thing for most engines, but people seem to struggle (or at least wasting time) to get a well working system.

If it really will be used, I'll optimize it further, to make it a real alternative to portal, pvs and other systems.

then I'll add more optimization and try to make it more of a 'drop in' solution. I've also designed everything in a way that it would be easy to get it working on PS3/X360, in case someone wants that.




Surprisingly I got request from non-graphics programmers, who said they might have use for it for AI (visibility checking seems to be a hassle as it's done with raycasts?) and a sound programmer said, with cubemap rendering, he could use it to check the occlusion of sound sources. (I guess that's easy to add).







I have implemented some software rasterizer based occlusion culling on my own, but yours is most impressive!

I wonder how it compares to modern hardware approches like CHC++. What do you think?
Which version of DX dose this work for? dx9 or dx11?


becuse i made 30K objects run at 100 fps with an GTX 280.
and im using DX11
@Tordin how many polygons?

@owl

I hoped some people want to use the lib for their project (and I got some pm which seem promising), as it is quite a basic thing for most engines, but people seem to struggle (or at least wasting time) to get a well working system.

If it really will be used, I'll optimize it further, to make it a real alternative to portal, pvs and other systems.

then I'll add more optimization and try to make it more of a 'drop in' solution. I've also designed everything in a way that it would be easy to get it working on PS3/X360, in case someone wants that.

Surprisingly I got request from non-graphics programmers, who said they might have use for it for AI (visibility checking seems to be a hassle as it's done with raycasts?) and a sound programmer said, with cubemap rendering, he could use it to check the occlusion of sound sources. (I guess that's easy to add).


Awesome, keep us posted!
@zapmya

From my point of view, occlusion culling is, just like frustum culling, a way to trade some CPU cycles for GPU cycles. Back in the early days of 3D, frustum culling on cpu was a real time consumer (I think on the first game I made it was 20%+ of the frame time), probably more than occlusion culling nowadays, as games usually don't occupy 100% of all cores, but with just one core and culling down to polygon level, you saw the impact of frustum culling.

That's why I prefer the cpu solution in general, if you would write a molecular simulator, occupying 100% cpu time and showing some ogl spheres, I wouldn't recomment the software version.

the GPU solutions have quite some issues from my experience

Solution 1: Every Drawcall is predicted by a bounding box and based on it's visibility, the actual drawcall executed or skipped, by hardware..

usage:

-For frustum culling, works on PSP

- occlusion culling on newer consoles

problems:

- 1. the commandbuffer has a stream of states that rely on previous settings, you can skip the actuall drawcalls, but you have to process all states. and you might generate "bubbles" in the GPU pipeline, the GPU can be processing a lot of commandbuffer, setting shader, constants, states etc. but skipping all drawcalls. the ALUs etc. will stay idle and you might get commandbuffer bound, it's something you really don't want to. (with frustum culling on PSP it can be 80% of the scene that you reject, if you move frustum culling to GPU)

-2 you might have quite a lot of CPU overhead to setup all that (e.g. skinning) although the HW will just ignore the drawcalls, (that would be especially be bad on PC APIs)

-3 for occlusion culling you might need to sort objects front to back, but that might lead to a lot of unnecessary state switches which might become the bottleneck, as you'd usual sort for states.


Solution 2: clustering objects.

usage:

- you start a query, draw an AABB of a cluster of objects, get the result next frame. works on most PC APIs

problems:

- clustering: it's not trivial to decide what objects to cluster, I saw some empirical models that generate cluster on developer machines while they are testing, submitting to some central PC and then those clusters are used like PVS. but with dynamic occlusion queries, which makes me wonder, why not use a PVS in that case, it would be more deterministic and simpler to implement. works anyway just with statics in that case.

- "ping pong" effects, clustering a big amount of objects leads to cheap test, but if just ONE pixel is visible, you draw all objects, as the testing has a latency and just splitting your cluster into smaller clusters that you want to test can lead to several frames of latency until something is visible. not only would that result in ugly popping of whole chunks of the level, you'd also miss a lot of occluder, which would lead to a lot of drawcalls that will be detected as "hidden" in some of the next frames. so, usually there is no hierarchy and you switch between those two states where all or nothing is visible. with clusters it's common that quite some area is empty, you have no tight volumes around objects, it might lead to ping-pong every frame and to stuttering frames although visually nothing changes. not fun to debug that.

- fillrate, when I was implementing this on my geforce 3 back then, I really saved quite some drawcalls and especially triangles/vertices, but I got fillrate limited. GPUs won't stop drawing tons of boxes just because they are visible, they will finish the whole job and give you the pixel count, this can be a penalty.


Solution: one object, one query, use report for next frame

usage: I think I saw something like that in UE3

problems:

- a query can cost something, some hardware has a limitation in the amount of queries it can buffer in a frame and also the amount of queries that can be 'in flight' in the pipeline. having a lot of cheap objects separated by queries might make you hit this limit. (and that is really an arse of an issue, if you cull more, it will be faster, cull less, everything will be slower, removing culling makes it even slower -> everything is fine? just till some GPU vendor writes you a mail how stupid you are ;) ). but I think the UE3 guys use this only for big chunks of the level, few drawcalls, it seems to be fine.

- similar to the ping pong problem, in one frame, you might detect something is occluded, so you dont draw it this frame, but this frame it might be not occluded, but as you didn't draw it, it also does not occlude anything behind it -> you draw everything the next frame. in UE3 you can see it e.g. in "Shangri La" (part of the demo), there is a fence and you can strafe left/right as spectator, the objects behind the fence will flicker.





solution: D3D11, using unordered buffers and Draw Indirect

usage: just a crazy idea I had, I think nobody uses it yet. You draw the simple meshes that you would usually tessellate in a "prepass", each is writing out it's drawcall ID to the framebuffer. Next pass you would use a kernel to set bits to an unordered buffer, based on the drawcall ID, 3rd pass would create a buffer that is passed to draw indexed instanced indirect, based on the 0s and 1s set in the unordered buffer.

pro/contra: I have no experience with that, ordered buffer seem to work quite fine in general.

here are quite some smart D3D11 user that might want to give it a try, no idea if it will work out at all ;)





@Tordin

I accept the challenge ;), if I draw just one triangle per object, I get 49fps with 135000 drawcalls per frame :) . Then we start to be API/Kernel limited I guess, unless we want to use instancing. Btw, my benchmarks are from the OpenGL demo, but I have a D3D9 version running as well.

I you intend to compare something more, you must specify more accurately how the scene and camera is set, so I could provide you numbers.










there is no preview button here :/

good work, keep it up!

I think in context to this, also this years GDC demo might be interesting for you to watch:

Mega Meshes - Modelling, rendering and lighting a world made of 100 billion polygons


http://miciwan.com/GDC2011/GDC2011_Mega_Meshes.pdf
@TiagoCosta : Around 15M i belive.


@Krypt0n : I dident mean to compare "our qualitys in anyway". i was more intrerested if you where using an DX9 version or a DX11 Version.
So both of us could improve the rendering.


And besides a note to every one. i dont belive the actual drawing in this case is the hard stuff, i belive the OC and FC is the part that takes more from the FPS. (thats only what i belive)
Forgot to add that i have 30-100 Drawcalls. im using instancing.
@Krypt0n

First, again, congratulation for this work! Looks like a really nice piece of tech you got here!

I would just like to know what is the overall algo behind the library. Some people already asked some of my question but I still have few:

- Is it meant to give an answer about visibility of each object for the current frame? Is there some computations that are delayed in between frames? (as in CHC++)

- If I understand well, you are first rendering a depth buffer. Then you again render each object to test per pixel visibility? Or are you keeping a list of pixel to test for each objects? (ok this would be a very memory consuming bad idea...) Or any other tricks?

- If there is rasterization, there is a buffer. What is the size of the buffer for the performance you are giving us? Is it the same size as the render buffer (pixel level visibility)? Can you use a buffer with different sizes? (conservative rasterization)



Thank you in advance for your answers! :)
Hi Seb :)

- Is it meant to give an answer about visibility of each object for the current frame? Is there some computations that are delayed in between frames? (as in CHC++)


it's made to be fast enough to give you the result for the current frame, it's not "instant" though. I think most people will submit the current state and then want the result to render. But the interface is having the possibility to work asynchronously, so you could in theory

-change stateof instances (e.g. transformations) in the culling lib

-notify the lib to start culling

-do something in-between, like calculating bone/skinning data, handle some streaming etc.

-sync to the culling lib

-render returned objcts

to make it async across frames, you would just reschedule this

- sync

-get list of objects

-change stateof instances (e.g. transformations) in the culling lib for the next frame

-notify culling lib

-process all returned objects

-swapbuffers to next frame

-update/logic etc.

- sync

the 2nd solution run the culling asynchronously, it has no object popping like the usually occlusion queries that are based on the last frame, but it adds one frame of latency.

- If I understand well, you are first rendering a depth buffer. Then you again render each object to test per pixel visibility? Or are you keeping a list of pixel to test for each objects? (ok this would be a very memory consuming bad idea...) Or any other tricks?

I don't have any separate buffer, although it wouldn't be that bad memory wise, doubling the framebuffer memory. in usually cases you could even assume there won't be more than 64k object and limit it to 16bit IDs, just the matrices for 135k objects are >8MB, so

If there is rasterization, there is a buffer. What is the size of the buffer for the performance you are giving us? Is it the same size as the render buffer (pixel level visibility)? Can you use a buffer with different sizes? (conservative rasterization)

I try to balance the resolution based on the time between triggering the culling and sync time. the lowest limit is 128x128 atm, but I'm thinking about some way to set the 'culling quality'.

most software occlusion culler work on lower res, as just testing wouldn't be fast enough due to "fillrate" if done in full res, due to this you of course can suffer from some false detected occlusions. there are some ways to compensate it a little, like extending the size of the primitives that you text, but there can always be cases e.g. some fence that looks solid in your occlusion culler and hides everything behind, but in reality has holes and the objects behind will disappear/flicker.

that's one disadvantage of the software solution, I admit ;)





If there is rasterization, there is a buffer. What is the size of the buffer for the performance you are giving us? Is it the same size as the render buffer (pixel level visibility)? Can you use a buffer with different sizes? (conservative rasterization)

I try to balance the resolution based on the time between triggering the culling and sync time. the lowest limit is 128x128 atm, but I'm thinking about some way to set the 'culling quality'.


That was one question I also had on my mind. "Culling quality" seems to be a nice concept to adjust the buffer's resolution but is it a good idea to do this automatically? Doesn't this introduce popping/flickering? I am thinking of scene where the camera is right in front of the above mentioned fence and from time to time some objects like birds or planes are appearing. Now would this extra objects change the buffer's resolution and thereby trigger culling of some previously visible objects?

Some more questions are keeping me busy :)

Does "camera is collided against objects" mean your are using ray casting in the broad phase?

Are you taking advantage of spatial coherency, e.g. using BVHs to cull a bunch of objects?

Are you utilizing temporal coherency? Something like "What is hidden this frame will also most probably be invisible in the next frame"

Sorry for asking so many questions but I am really into this kind of stuff and obviously you are the right person to ask ;-)

Thanks in advance
hi :)

the lower resolution is already a source of error, adapting it to performance is rather improving the error, otherwise you might want to always run in the fastest setup and having always the error. but in general the error isn't that noticeable, some games use that kind of occlusion culling (like Warhawk, crysis1, battlefield 3) and it seems to run fine.




Does "camera is collided against objects" mean your are using ray casting in the broad phase?

a broad phase in physics usually means that you put everything into bounding objects (e.g. bounding boxes or spheres or simple primitives like capsule..) and just test those bounding volumes against each other to receive a list of potential collisions, so no, I don't do any raycasting for collisions/visibility.

Are you taking advantage of spatial coherency, e.g. using BVHs to cull a bunch of objects?

yes, there are some spatial partitioning going on, as I want it to be a generic solution and in the usual world, you see probably less than 1% of all the objects that people place in the world.

Are you utilizing temporal coherency? Something like "What is hidden this frame will also most probably be invisible in the next frame"


Short: NO, I don't.

Long: there are two possibilities, coherency for how long ago something was checked and visible and how long ago it was checked and invisible.

Invisible:

- in that case you would gamble that something won't be visible, in a recent shooter(fps) I was involved, we tried this at first, it worked in most cases (I'd guess 99.9%), but if some QA tester went into cover and jumped out of it, most of the world was usually invisible and popped in after some frames. but even worse, because there were no proper occluder (as they were assumed to be invisible), usually occluded objects were marked visible and were pushed into the streaming system, which rejected objects which actually should be visible, to stream in invisible objects. This coherency was really not acceptable.

Visible:

- if objects are visible, you could assume, that in average you get a "true" after testing 50% of the pixel. if you make it a little bit smarter, you'll actually be done very early if something is visible. in my lib, I think it's about 2% of the time that is wasted on visible objects, 98% on invisible objects (% in regard to cpu time, using some common profiling tool). so we are talking about speeding up 2% in best cases. On the other side, those tests are relatively cheap compared to the rendering of objects, testing every frame and rejecting as many drawcalls as possibile, even if you spend those extra 2%, will probably save you more than 2% of the frame time.

that's why I don't use any coherency.

In previous occlusion cullers I added coherency testing, mainly to save rasterization time, buy approximating the current zbuffer from the old zbuffer, for this I had to strictly split static from dynamic objects. I did not want to add this kind of limitation to this lib.




cheers :)







short update:

everyone who requested an sdk should have it by now, if not please contact me, but check your spam folder first :)

Cheers :)


PARTNERS