Jump to content
  • Advertisement
Sign in to follow this  
spek

Light-culling in a Compute-Shader

This topic is 2266 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Just rolled in Compute-Shaders (OpenCL) and trying to cull lights for a "Tiled Deferred Rendering" approach. The question is, how to properly cull? Basically I want to check:
1- Cull sky
2- Z/Depth. Prevent tiles bother lamps that are in their fore- or background
3- For tiles that have a (nearly) equal normal, check if the light can affect it (dot product check). Skip back-lighting
4- Check if a sphere/cone (spotlight) overlaps the tile

Check 1 (sky)
Simple, just skip tiles that have a depth far-far away.


Check 2 (Z-test)
is doable too. Or at least... for spheres. I split the screen in small 32x32 pixel tiles, and downscale the (eye)Z-Buffer to a same resolution. This downscaled version contains the Minimum and Maximum depth (camera distance) for each tile. So, a pointlight could be tested like this:
camLampDist = length( camera.xyz - lamp.xyz );
inRange = ( camLampDist + lampRadius > tileMinZ ) && ( camLampDist - lampRadius < tileMaxZ );

It gets nastier for cone lights though. My math is crap, but I guess you'll have to calculate a minimum and maximum Z of the spotlight by using it's projection matrix or something.


Check 3 (normals)
I also downscale the G-Buffer that contains the normals, and make an average normal per tile. I also check if all normals are nearly the same inside a tile. If not, the average normal is useless to test with. But for flat surfaces, we can check if a spotlight can possibly affect it:
canAffectNormal = dot( spotLight.shineDirection.xyz, -tileNormal.xyz ) >= 0 - margin;

For point lights, it gets more difficult as that one shines in all directions. If we would know a position, we could calculate the vector between the tile and the light center, but obviously we can't store a single average position per tile, as the tile contents differ for each pixel.


Check 4 (volume overlap)
The most important one. Again, AFAIK, we can't make use of position-data so checking if a tile is within distance would be difficult / impossible. I guess it's possible to transform the tiles and light positions to 2D-screen coordinates, and then check if a tile overlaps a pointlight circle in 2D space. But erh... that requires the ModelViewProjection matrix right? I also wonder how to compute a 2D circle for lights that have their centers behind the camera.

Anyway, how would you do a similiar test for a cone shape then? It should be possible somehow, the Battlefield 3 slides are showing how they cull quite nicely. But how they do it... beats me.



Maybe my whole idea of culling is wrong, so please, enlight me :)
Rick

Share this post


Link to post
Share on other sites
Advertisement
I was mucking about with this a while back, and I went down the road of getting min and max xyz values per tile in view space (using the atomic operations), and generating a simple tile bounding volume from that. (frustum, sphere and abb I tried). The choice of volume will impact performance differently depending on how expensive your shading is vs how many lights you have vs the resulting culling pass time.
Then I just treat both my spots and my point lights atm as spheres, quick intersection test (one light test per thread in group), followed by using that list for the actual per pixel shading (one thread per pixel at that point).

I'm sure there are much better / cleverer ways of doing it, especially for spotlights. Tbh if you passed in as constants or even generated from the spot params a bounding frustum, it would be pretty quick to cull against the light frustum.

Share this post


Link to post
Share on other sites
Hey thanks!

>> getting min/max xyz using atomic operations
Maybe its useful to know what that exactly means. each paper talks about it but slap me in the face :) AFAIK, an atomic operation is a single operation that can't be broken (interrupted by another thread). But ifso, I still don't see how you could get these values in a single call.In order to get the min/max Z, I did the following:
- Make a target texture with a 45 x 29 pixel resolution (which is my screen / 32x32 pixel tiles)
- For each targetpixel,loop through 32x32 pixels in the original depth texture, and store the min/max value. Thus that means 1024 tex2D operations per tile.

Quite a lot work!

I see how making a bounding box can help detecting sphere collisions. Combined with other checks it should cull pretty well. But for cones... Making a sphere out of a cone produces a lot of unused volume...

Share this post


Link to post
Share on other sites
Hi spek!

Have you seen the Tile-based Deferred Shading source code from Andrew Lauritzen? It might help you a little.


>> getting min/max xyz using atomic operations
Maybe its useful to know what that exactly means. each paper talks about it but slap me in the face smile.png

The min/max z value can be obtained by launching 32x32 threads per tile (considering each tile as a thread group). Thus, for each pixel one thread is responsible. Then, each thread reads the pixel’s depth and interprets this float value as an uint (asuint intrinsic).
With the atomics InterlockedMin and InterlockedMax on a shared memory variable you can find the minimum and maximum depth value for each tile in parallel. (The atomics don’t work on floats, thus the nasty trick with the reinterpretation of the bits…) Afterwards you just sync the threads (GroupMemoryBarrierWithGroupSync). And now you can reinterpret the values again as a float (asfloat).
You also get the jist, if you look at the "DirectX 11 Rendering in Battlefield 3" slides from DICE, which can be found here.


AFAIK, an atomic operation is a single operation that can't be broken (interrupted by another thread).

Correct!


I see how making a bounding box can help detecting sphere collisions. Combined with other checks it should cull pretty well. But for cones... Making a sphere out of a cone produces a lot of unused volume...

Maybe you find something useful here.
What about approximating the lights by frustums, too and then do frustum-to-frustum tests?

Best regards

Share this post


Link to post
Share on other sites
Didn't see the source code, got something nice to study tonight, thanks!
I'll be studying links this info. And likely I'll be back with questions later tonight/tomorrow ;)

Share this post


Link to post
Share on other sites
Ok, I had a look into that source, though it was a bit much to get a quick understanding. At least for me, my head has limited bandwidth :)

It required re-reading the example "ComputeShaderTile.hlsl" file several times with your explanation next to it, but it makes sense now:
1- Declare a min/max variable shared amonst the entire group, a 32x32 pix tile in mu case
2- The compute shader program will be executed for each pixel separate.
3- Each thread reads the depth, or actually mutliple samples due MSAA
4- Wait until all group threads are done
5- Do an atomic min/max with the shared group "min/max" variables and our own depth value we read in step 3

Now it's just a matter of finding the OpenCL variants on the Interlock / GroupBarrier functions and declaring shared group variables.



Next step, the culling. What I don't completely understand in the example... the Light Culling happens in the same program, right after the Z values have been calculated (and synced). Doesn't this mean ALL screen pixels are actually computing whether they are affected by light-X or not? Ifso, I can use the pixel-world position to perform pretty accurate culling techniques. But then again, I guess when doing relative complex calculations for each pixel, I'm missing the whole performance-gain.

Maybe I'm not reading the code well... As said, it always takes time before I get the clue :P

Share this post


Link to post
Share on other sites
Hi Spek,


Next step, the culling. What I don't completely understand in the example... the Light Culling happens in the same program, right after the Z values have been calculated (and synced). Doesn't this mean ALL screen pixels are actually computing whether they are affected by light-X or not?

The beauty of the tile-based lighting is that the culling of the lights is parallelized (I get to that in a minute) and that each pixel afterwards only iterates over the “contributing” (non-culled) lights.
Even better, we only accumulate the colors in local memory (in standard deferred lighting it is global memory, with rasterizer and blending overhead) and the G-buffer is only accessed once (in standard deferred lighting it is once per light), which saves a lot bandwidth. This is where the huge benefit comes from.

Let’s see. We already covered that for each pixel, there is one thread. A tile is made of 32x32=1024 threads. Now the idea is to take advantage of the fact that most of the pixels in a tile will be affected by the same lights. So, instead of culling the lights against the pixels, we just cull them against the whole tile (which is why we computed the min/max depth of the tile). Sure, this test is a little coarse, but the coherence between the pixels is usually quite high.
So let’s say we have for example 4096 lights. We also have 1024 threads per tile (32x32). This means, for culling those 4096 lights we have 1024 threads available. Thus, each thread only needs to cull 4 lights and we got them all. smile.png All we need to do is to iterate interleaved over the light array and cull the hand full of lights (note that in the loop we have lightIndex += COMPUTE_SHADER_TILE_GROUP_SIZE).
The first thread would cull light number 0, 1024, 2048, ...
The second thread then 1, 1025, 2049, ... and so on. smile.png
Every light that does intersect with the frustum is added to a list (in shared memory). And eventually, after synchronizing, every threads gets back to its original responsibility: the pixel it was launched for and iterates over the light list stored in shared memory (which is available to the whole tile) to compute and accumulate the actual lighting.

I hope this makes it a little clearer. smile.png
Cheers!

Share this post


Link to post
Share on other sites
And thanks again! This sure clears up the way how workgroups and such work. Too bad though it seems my laptop doesn't support atomic operations (2009 hardware). OpenCL smacks me with an "Unsupported operation" message when trying atom_add / atom_inc or the likes. Is there a work-around, other than downscaling the Z-buffer with an ordinary fragment shader like I did so far?

The example indeed loops on a certain "interval", rather than checking all lights for each thread. I won't be having that much lights though. Let's say I have 10 lights only, I could only let the first 10 threads check 1 light each, then let the other 1014 threads do nothing.

I think I get a bit now, how the workload is divided on the groups and their threads. Got to find something on the atomic operations, and figure out the culling math now.



@Frenetic Pony
Crap, each time I try to learn one technique, two new ones pop up smile.png I'll first try to master Tiled Deferred Lighting before proceeding hehe, but thanks for the paper nonetheless!


edit
-----
Another sub, OpenCL-related question. Let's say my screen is 640 x 320 pixels, and I want to divide it in 20x10 tiles, each using 32x32 threads. I thought you would run it like this then

clEnqueueNDRangeKernel( clCommandQueue, kernel, 2, //dimensions
null, // offset
pointerTo[20,10], // global worksize
pointerTo[32,32], // local worksize
0, null, null);

But this doesn't work, for one reason because the local-worksize can't be bigger than the global worksize... Probably I'm confusing things.
- Found it out. The global worksize shouldn't be the number of workgroups, but the total amount of the work/items. And for that reason, the worksize must be a number smaller AND dividable through the worksize. Thus in this case, [640,320] should be the global worksize.
Also make sure the local worksize isn't too big. My computer could only use 16x16 groups...

Greets Edited by spek

Share this post


Link to post
Share on other sites
@spek

No worries! The paper is based on tiled anyway, and is only really beneficial if you have thousands of lights. If you've only got a few hundred, then you should stick with tiled. Edited by Frenetic Pony

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!