Just been listening to this talk, and not sure If i understand the technique correctly. This is what I got from it
- Render the scene obejcts to a depth buffer, which I believe are the AABBS of all the scene objects
- Detach this depth buffer and use as a texture resource
- Create a vertex buffer of all the scene objects AABBS
- Render this stream converting to NDC space in the vertex shader and testing against the depth buffer, outputing the result from the vertex shader
- Read this stream back on the CPU end and render objects properly based on the occulsion results in the vertex buffer
Rendering with Conviction The Graphics of Splinter Cell
This is what I got out of it:
For each frame:
* render occluder geometry (this is different than bounding boxes! it represents the minimum extents of occluding geometry) to a depth buffer
* build a hierarchy of downsized depth buffers using the MAX depth of a 4-pixel square at each iteration
For each object:
* using object AABB, calculate the AABB's minimum (closest) depth in screen space
* find out how large the AABB is in screen space, use that to determine which level of the hierarchy to test
* test the hierarchy to figure out if the object's minimum depth is larger than the hierarchy level's depth, and wait for the results before rendering that frame
I don't remember how they said they implemented the hierarchy testing part.
For each frame:
* render occluder geometry (this is different than bounding boxes! it represents the minimum extents of occluding geometry) to a depth buffer
* build a hierarchy of downsized depth buffers using the MAX depth of a 4-pixel square at each iteration
For each object:
* using object AABB, calculate the AABB's minimum (closest) depth in screen space
* find out how large the AABB is in screen space, use that to determine which level of the hierarchy to test
* test the hierarchy to figure out if the object's minimum depth is larger than the hierarchy level's depth, and wait for the results before rendering that frame
I don't remember how they said they implemented the hierarchy testing part.
Yeah, its rather annoying. They dont seem to mention where the test is done. I believe its on the GPU, but then that would require sending a large buffer, (20,000) quieries in their case to and from the GPU.
The Real Time Rendering site has an article based off this talk, and they mention implementation.
http://www.realtimerendering.com/blog/update-on-splinter-cell-conviction-rendering/
http://www.realtimerendering.com/blog/update-on-splinter-cell-conviction-rendering/
Yeah the occlusion geometry is most likely just polygons authored by the artists (IIRC they mention problems when these are out-of-sync with the visual polygons)
For example, on the 360, the GPU acts as the motherboard's "northbridge" - the CPU's gateway to the system's RAM - so accessing this hierarchical depth buffer on the CPU would be pretty cheap.
On PC, they might copy the data to a lockable texture, then repeatedly try to access it with the DONOTWAIT flag to avoid stalling the CPU/GPU during the transfer... or they might use the occlusion queries in some way.
[edit]I didn't see Daark's link, which says it's all GPU!
They might use predicated draw calls on some platforms, though I remember him saying that they chose to wait for all occlusion results to be available, whereas predication is non-stalling (not guaranteed to cull if the result isn't available yet).
Quote:Original post by hick18Well they've got at least 3 implementations of the renderer (DX9-360, DX9-PC and Mac-GL) so it could even be different for each one...
Yeah, its rather annoying. They dont seem to mention where the test is done. I believe its on the GPU, but then that would require sending a large buffer, (20,000) quieries in their case to and from the GPU.
For example, on the 360, the GPU acts as the motherboard's "northbridge" - the CPU's gateway to the system's RAM - so accessing this hierarchical depth buffer on the CPU would be pretty cheap.
On PC, they might copy the data to a lockable texture, then repeatedly try to access it with the DONOTWAIT flag to avoid stalling the CPU/GPU during the transfer... or they might use the occlusion queries in some way.
[edit]I didn't see Daark's link, which says it's all GPU!
They might use predicated draw calls on some platforms, though I remember him saying that they chose to wait for all occlusion results to be available, whereas predication is non-stalling (not guaranteed to cull if the result isn't available yet).
Well, I'll wait for his blog post that he says he maybe writing. But I think the talk is very incomplete, failing to mention the critical implementation details that make the system feasable at a descent framerate. I thought the whole point of these talks were to resolve some of these sorts of questions that other developers might have.
I dont expect them to give away all the details, as they'll probaly want to keep some of their tricks to themselves.
When you say this, do you mean use the results from the last frame instead? or do other work and try again? ( in the same frame ).
I dont expect them to give away all the details, as they'll probaly want to keep some of their tricks to themselves.
Quote:
On PC, they might copy the data to a lockable texture, then repeatedly try to access it with the DONOTWAIT flag to avoid stalling the CPU/GPU during the transfer
When you say this, do you mean use the results from the last frame instead? or do other work and try again? ( in the same frame ).
Yeah I meant the same frame. So you'd do your depth/occlusion pass, then request to read the data immediately, which will likely not be available yet (but will initiate the transfer). So you then use the CPU to go do some more game logic for that frame (or any 'latent' logic, like path-finding), and then when that's done you ask for the data from the GPU again (perhaps stalling this time if you have to). Then using the occlusion info, you do the main render.
Actually, I just noticed this particular part of the talk:
*They do a Z-only render of their artist-created occlusion models (not everything would have one of these models - mostly just the environment).
*They then build a full mip-chain from this Z-buffer, taking the maximum depth value when down-sizing (instead of the usual bilinear filtering when down-sizing).
*Then they render "the occlusion testing" (my guesses below) to a large texture, and read this texture back to the CPU to determine which objects to draw.
The magic is obviously in the "Test Object Bounds" part. My guess is, that each object in the scene is rendered as a single point during this step (the PDF mentions this is a "single batch (one big POINTLIST VB of bounds)").
If your result render-target is 1024*1024px, that lets you test 1M unique points in one draw-call! They mention doing "over 20,000", so they could use a 256x128 render-target (32768 pixles) and have room to spare.
Each point in this big vertex buffer has a unique 2D position (i.e. each object is associated with a unique output pixel) and also contains the AABB/OBB data (packed into the vertex attributes, instead of normals/tex-coords/etc).
In the vertex shader, you take the AABB/OBB and project it into screen space. Depending on the area covered, you choose an appropriate mip-level of the depth buffer. You then sample every depth-texel covered by that screen-area and compare the sampled depth values against the min-depth of the AABB/OBB.
If any of the depth tests pass, you output white to the fragment shader, otherwise you output black. The fragment shader just passes through this colour.
After rendering your big vertex-buffer of points, you've got a 2D array of true/false (white/black) values (one for each occludable object) saying what has passed the hierarchical-depth test. They then read this back to the CPU (possibly like I described above) and use it to skip unnecessary draw-commands.
[edit]
Wow... now that I've written that down, it doesn't sound so complicated any more, and only requires a card with VTF (older ATI cards implement R2VB instead of VTF though, so you'd need two versions of the shaders if you want it to work on older cards).
[Edited by - Hodgman on July 21, 2010 8:35:47 AM]
Actually, I just noticed this particular part of the talk:
Render Occluders -> Build Z Hierarchy -> Test Object Bounds <-- Read Back --Besides read-back, all of this happens on the GPU
To me, this implies the following:*They do a Z-only render of their artist-created occlusion models (not everything would have one of these models - mostly just the environment).
*They then build a full mip-chain from this Z-buffer, taking the maximum depth value when down-sizing (instead of the usual bilinear filtering when down-sizing).
*Then they render "the occlusion testing" (my guesses below) to a large texture, and read this texture back to the CPU to determine which objects to draw.
The magic is obviously in the "Test Object Bounds" part. My guess is, that each object in the scene is rendered as a single point during this step (the PDF mentions this is a "single batch (one big POINTLIST VB of bounds)").
If your result render-target is 1024*1024px, that lets you test 1M unique points in one draw-call! They mention doing "over 20,000", so they could use a 256x128 render-target (32768 pixles) and have room to spare.
Each point in this big vertex buffer has a unique 2D position (i.e. each object is associated with a unique output pixel) and also contains the AABB/OBB data (packed into the vertex attributes, instead of normals/tex-coords/etc).
In the vertex shader, you take the AABB/OBB and project it into screen space. Depending on the area covered, you choose an appropriate mip-level of the depth buffer. You then sample every depth-texel covered by that screen-area and compare the sampled depth values against the min-depth of the AABB/OBB.
If any of the depth tests pass, you output white to the fragment shader, otherwise you output black. The fragment shader just passes through this colour.
After rendering your big vertex-buffer of points, you've got a 2D array of true/false (white/black) values (one for each occludable object) saying what has passed the hierarchical-depth test. They then read this back to the CPU (possibly like I described above) and use it to skip unnecessary draw-commands.
[edit]
Wow... now that I've written that down, it doesn't sound so complicated any more, and only requires a card with VTF (older ATI cards implement R2VB instead of VTF though, so you'd need two versions of the shaders if you want it to work on older cards).
[Edited by - Hodgman on July 21, 2010 8:35:47 AM]
Yeah, thats how understood it as well. But I wasnt going to bother with a pixel shader, I was just going to create a stream out pass, pack each AABB into a vertex, and process the buffer. But the 2 big time intensive steps in the whole process I can see are :
a) Locking and filling a buffer of AABB's on the CPU to test
b) Locking and retrieving the buffer of results on the CPU
The fact that he mentions its all done on the GPU made me think I had the above wrong.
I dont see the need to create a mip map chain of the Zbuffer though. Is it to help with texture cache? if so, I think the savings would be lost with the creation of the chain.
a) Locking and filling a buffer of AABB's on the CPU to test
b) Locking and retrieving the buffer of results on the CPU
The fact that he mentions its all done on the GPU made me think I had the above wrong.
I dont see the need to create a mip map chain of the Zbuffer though. Is it to help with texture cache? if so, I think the savings would be lost with the creation of the chain.
I think the point of the hierarchical mip-map chain thing is so that they only have to do a maximum of four texture samples for each test. If the object's AABB is big enough that it'd take up a large amount of screen space, they use a low-res enough image from the heirarchy that they only need to sample the 4 nearest points to conservatively cover the entire screen area that the object could possibly take up. If you think about it for a while it makes sense.
Yeah as venzon says, without the HZB, the gather cost of sampling every covered ZB value would be too high.
If lots of the occludees are static (which could be the case), you could split them into a seperate VB which isn't updated, which could dramatically reduce this upload size.
I'm not sure if the diagram in the presentation shows an OBB or an AABB - this would be a trade-off between accuracy + more CPU time VS bigger upload size I guess.
e.g. say you rendered out white/black pixels to an RGBA render target.
First you compress the width by 8x, by sampling groups of 8 texels and packing them into the one colour.
Then you compress the height by 4x, by sampling groups of 4 texels (each of which contains 8 boolean values) and storing them into the R,G,B and A channels.
This reduces your 256x128 RGBA render-target to 32x32, or 4KB, which should be pretty easy to transfer.
These steps can of course be adjusted depending on the bit-depth and number of channels in the render-target format.
[edit]Fixed "occluders" to "occludees" in some places
[Edited by - Hodgman on July 21, 2010 11:02:20 PM]
Quote:Original post by HodgmanNow that I've got some sleep, to remove the VTF requirement you could simply move the hierarchical-depth-testing part to the fragment shader, and then this works on pretty much any shader-supporting hardware! Even a GeForce3 could implement this :D
now that I've written that down, it doesn't sound so complicated any more, and only requires a card with VTF
Quote:Original post by hick18This is "bad", but probably not that bad. For example, each vertex/AABB could be 2 float3's (WS position and extents). Multiply by 20K and that's ~500KB - large, but probably doable assuming you're not uploading lots of other stuff to the card each frame (PCIe v1.0 - from 2004 - can do 4MB per frame @ 60Hz). To take it further, you could store your extents as 8-bit, in inches, limiting you to a max occludee width of 255" (~6.5m), but reducing that upload size to ~300KB.
But the 2 big time intensive steps in the whole process I can see are :
a) Locking and filling a buffer of AABB's on the CPU to test
If lots of the occludees are static (which could be the case), you could split them into a seperate VB which isn't updated, which could dramatically reduce this upload size.
I'm not sure if the diagram in the presentation shows an OBB or an AABB - this would be a trade-off between accuracy + more CPU time VS bigger upload size I guess.
Quote:b) Locking and retrieving the buffer of results on the CPUSeeing the results are true/false, you could use a shader to compress these before downloading them to the CPU.
e.g. say you rendered out white/black pixels to an RGBA render target.
First you compress the width by 8x, by sampling groups of 8 texels and packing them into the one colour.
Then you compress the height by 4x, by sampling groups of 4 texels (each of which contains 8 boolean values) and storing them into the R,G,B and A channels.
This reduces your 256x128 RGBA render-target to 32x32, or 4KB, which should be pretty easy to transfer.
These steps can of course be adjusted depending on the bit-depth and number of channels in the render-target format.
[edit]Fixed "occluders" to "occludees" in some places
[Edited by - Hodgman on July 21, 2010 11:02:20 PM]
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement