Jump to content
  • Advertisement
Sign in to follow this  
solenoidz

Depth buffer downloading/downsampling (occlusion cull)

This topic is 2565 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi GuysI have a deferred renderer and one of my render targets is 32 bit floating point RT, where I store the scene depth. I need to implement an occlusion culling system that reads the scene depth buffer and checks objects bounding boxes against it to decide whether the object in question is visible and is to be rendered. Something like Frostbite DICE, software occlusion culling. I want to use my existing depth buffer, wich has viewport dimentions(for example 800x600) and downsample it to 256x114 pixels, lock that and read the depth on the CPU side. Because a render target has to be created with D3DPOOL_DEFAULT, and a surface with that flag can't be locked, I'm using g_pd3dDevice->GetRenderTargetData(pSurfRT, pSurfSysMem); to download the data to a lockable sysmem surface. To do this, the two surfaces has to be with the same dimentions and format.I was planing to use StretchRect to stretch the surface to a smaller one (256x114), but the method failed, probably because the docs say that StretchRect needs the two surfaces to be created with D3DPOOL_DEFAULT - video memory.so it doesn't work...How can I get to the CPU, very FAST, a smaller version of my already rendered depth buffer ?I heard of some way to perform the culling entirely in the GPU, by uploading the objects bounding boxes as texture coordinates and render a point list, so the result is a back and white checker texture, wich encodes the visibility of maximum of textureWidth*textureHeight objects ?Thank you in advance for any tips concerning this kind of occlusion culling.

Share this post


Link to post
Share on other sites
Advertisement
it's not really about VERY FAST, but about latency. usually the gpu works quite some time behind the cpu, so if you're saying you want "a smaller version of my already rendered depth buffer", you probably mean, the cpu is done with sending the drawcalls? that doesn't mean the gpu is done. if you want to copy the zbuffer, even as 1x1 pixel, you gonna need to stall on the cpu, until the gpu is done. once this is the case, your gpu will idle and wait to get new data from the cpu.

you might end slower than what you will save.


as for pure gpu based occluson culling solutions, you'd need to name the api + version and also what your limitation is, vertices? drawcalls? fragment bound? this hugely impacts the appropriate solution. e.g. if you were drawcall bound, any occlusion culling technique on gpu side might make the situation worse.




Share this post


Link to post
Share on other sites
Yeah if you want a GPU render target on the CPU, then you need to wait until several frames after you've issued all rendering commands for the render target if you don't want to cause a CPU/GPU synch point. This obviously severely limits its usefulness for something like occlusion culling, since you'd always be a few frames behind. The Splinter Cell guys got away with it on the 360 because consoles are a much different environment, and are much more suitable for optimizing CPU/GPU synchronization (not to mention the fact that the 360 also has unified memory). The DICE guys actually rasterize their depth buffer on the CPU, specifically so that they can avoid latency issues.

Share this post


Link to post
Share on other sites
As the guys say, you can't. Pulling back from the GPU to the CPU is never going to be a fast operation.

From the look of things you're using D3D9, so have you read the documentation on IDirect3DQuery9 (specifically the parts relating to occlusion queries)? You've already got your depth buffer filled in, so the basic (not going to be as fast as possible, but will suffice for now) approach is you create a bunch of IDirect3DQuery9 instances, one for each object you're going to test. Then switch off color and depth writing, but leave depth testing on, and for each object you issue a begin, draw the object's bounding box/sphere/etc, then issue an end. Then you go away and do something else for a while; something that's going to take some time. Run some physics, some sound, whatever. After that's done you run GetData on each query and it will tell you how many pixels passed the depth test; if 0 passed the object isn't visible. (For simpler objects you can just assume that the overhead of running the query plus drawing the bounding box is going to be higher than the overhead of just drawing the object. How much simpler? Depends; you'll need to profile.)

This will still need to stall the pipeline, but it should have much less overhead than pulling back the depth buffer, locking it, extracting the data, resizing it and testing objects in software. If you're vsynced or already bottlenecking on the CPU you may not even notice it...

A more performant method involves testing to see if the result is ready yet, and if not then just use the previous frame's result (and onbiously you can't issue a new query for that object until the old one has completed either) - assuming that in any typical scene things really don't change too much from frame-to-frame; that will allow the GPU to keep ahead of the CPU, but is a mite more fiddly to set up.

Share this post


Link to post
Share on other sites
Thanks..Yes, I use Direct3D 9.0
So Frostbite DICE are using a software rasterizer and it's faster than downloading and locking a 256x114 texture from the GPU ?
That said, I had an idea to skip the resizing of the depth buffer, by using a MRT and just render to another smaller render target, deducated to occlusion stuff, but as you said the problem is that the target would be ready for the CPU after several frames, and that's unacceptable, because it could introduce many artefacts.

@mhagain
Currently I do exactly what you described, using Direct3D9Query for every object and fetching the result the next frame, but this is what people say about this approach :
[color="#1c2837"]link

[color="#1c2837"]Also, i'm not really happy with it, because I need to render every object every frame issuing draw calls for either the object's bounding box or object's actual geometry. I believe with a depth buffer present on the CPu side, I can easily check object's AABBoxes and not even try to render them if not visible.

[color="#1c2837"]So Frostbite uses software rasterizer..hmm..Can you guys point me to a source, where I can learn how to properly do that kind of depth rendering on the CPU ?
[color="#1c2837"]I have a basic idea how to scanline a triangle and interpolate depth, but a demo/article/source code would be very useful.
[color="#1c2837"]Thanks.

Share this post


Link to post
Share on other sites
Hi,
if you're not afraid of perhaps not-so-learning friendly code, my MIT-licensed 3D engine contains CPU occlusion culling code (triangle mesh depth rasterization to integer buffer + generation of depth mipmap pyramid + test of occludee AABB against the depth buffer)

http://code.google.c...usionBuffer.cpp
http://code.google.c...clusionBuffer.h

The triangle rasterization is based on Chris Hecker's software rendering articles.

Share this post


Link to post
Share on other sites
while I don't offer source, I have a solution that also works quite decent: http://www.gamedev.net/page/community/iotd/index.html/_/one-billion-polys-r85

if you're interested, drop me a pm :)


Share this post


Link to post
Share on other sites
Thanks. I started implementing a software renderer and I have an idea for my current hardware occlusion culling implementation to go one level up, and start to occlude bounding boxes of octree nodes with nearby objects in order to reduce the draw calls.
This my simple triangle rasterizer, that takes 3 screen space points and rasterize them in two passes. Depth interpolation shouldn't be a problem, but i need to implemet clipping in order to make it useful in practice.

Point2D A,B,C ;
int indexPointC = ExtreamePoint(points[0].y , points[1].y , points[2].y , true) ;
C = points[indexPointC] ;
int indexPointA = ExtreamePoint(points[0].y , points[1].y , points[2].y , false) ;
A = points[indexPointA] ;
B = points[3 - (indexPointC + indexPointA)];


Point2D XL,XR ;

XL = XR = A ;
float fdeltaL, fdeltaR, fdeltaLup ;
fdeltaL = float((B.x - A.x) / (B.y - A.y) ) ;
fdeltaR = float((C.x - A.x) / (C.y - A.y) );
fdeltaLup = float( (C.x - B.x) / (C.y - B.y) ) ;

while( XL.y <= B.y )
{
XL.x += fdeltaL ;
XR.x += fdeltaR ;

horizline(XL.x , XR.x , XL.y) ;
XL.y ++ ;


}

XL = XR = C ;
while( XR.y >= B.y )
{
XL.x -= fdeltaLup ;
XR.x -= fdeltaR ;

horizline(XL.x , XR.x , XR.y) ;
XR.y -- ;


}

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!