• Advertisement
Sign in to follow this  

Light-culling in a Compute-Shader

This topic is 2089 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Just rolled in Compute-Shaders (OpenCL) and trying to cull lights for a "Tiled Deferred Rendering" approach. The question is, how to properly cull? Basically I want to check:
1- Cull sky
2- Z/Depth. Prevent tiles bother lamps that are in their fore- or background
3- For tiles that have a (nearly) equal normal, check if the light can affect it (dot product check). Skip back-lighting
4- Check if a sphere/cone (spotlight) overlaps the tile

Check 1 (sky)
Simple, just skip tiles that have a depth far-far away.


Check 2 (Z-test)
is doable too. Or at least... for spheres. I split the screen in small 32x32 pixel tiles, and downscale the (eye)Z-Buffer to a same resolution. This downscaled version contains the Minimum and Maximum depth (camera distance) for each tile. So, a pointlight could be tested like this:
camLampDist = length( camera.xyz - lamp.xyz );
inRange = ( camLampDist + lampRadius > tileMinZ ) && ( camLampDist - lampRadius < tileMaxZ );

It gets nastier for cone lights though. My math is crap, but I guess you'll have to calculate a minimum and maximum Z of the spotlight by using it's projection matrix or something.


Check 3 (normals)
I also downscale the G-Buffer that contains the normals, and make an average normal per tile. I also check if all normals are nearly the same inside a tile. If not, the average normal is useless to test with. But for flat surfaces, we can check if a spotlight can possibly affect it:
canAffectNormal = dot( spotLight.shineDirection.xyz, -tileNormal.xyz ) >= 0 - margin;

For point lights, it gets more difficult as that one shines in all directions. If we would know a position, we could calculate the vector between the tile and the light center, but obviously we can't store a single average position per tile, as the tile contents differ for each pixel.


Check 4 (volume overlap)
The most important one. Again, AFAIK, we can't make use of position-data so checking if a tile is within distance would be difficult / impossible. I guess it's possible to transform the tiles and light positions to 2D-screen coordinates, and then check if a tile overlaps a pointlight circle in 2D space. But erh... that requires the ModelViewProjection matrix right? I also wonder how to compute a 2D circle for lights that have their centers behind the camera.

Anyway, how would you do a similiar test for a cone shape then? It should be possible somehow, the Battlefield 3 slides are showing how they cull quite nicely. But how they do it... beats me.



Maybe my whole idea of culling is wrong, so please, enlight me :)
Rick

Share this post


Link to post
Share on other sites
Advertisement
I was mucking about with this a while back, and I went down the road of getting min and max xyz values per tile in view space (using the atomic operations), and generating a simple tile bounding volume from that. (frustum, sphere and abb I tried). The choice of volume will impact performance differently depending on how expensive your shading is vs how many lights you have vs the resulting culling pass time.
Then I just treat both my spots and my point lights atm as spheres, quick intersection test (one light test per thread in group), followed by using that list for the actual per pixel shading (one thread per pixel at that point).

I'm sure there are much better / cleverer ways of doing it, especially for spotlights. Tbh if you passed in as constants or even generated from the spot params a bounding frustum, it would be pretty quick to cull against the light frustum.

Share this post


Link to post
Share on other sites
Hey thanks!

>> getting min/max xyz using atomic operations
Maybe its useful to know what that exactly means. each paper talks about it but slap me in the face :) AFAIK, an atomic operation is a single operation that can't be broken (interrupted by another thread). But ifso, I still don't see how you could get these values in a single call.In order to get the min/max Z, I did the following:
- Make a target texture with a 45 x 29 pixel resolution (which is my screen / 32x32 pixel tiles)
- For each targetpixel,loop through 32x32 pixels in the original depth texture, and store the min/max value. Thus that means 1024 tex2D operations per tile.

Quite a lot work!

I see how making a bounding box can help detecting sphere collisions. Combined with other checks it should cull pretty well. But for cones... Making a sphere out of a cone produces a lot of unused volume...

Share this post


Link to post
Share on other sites
Hi spek!

Have you seen the Tile-based Deferred Shading source code from Andrew Lauritzen? It might help you a little.


>> getting min/max xyz using atomic operations
Maybe its useful to know what that exactly means. each paper talks about it but slap me in the face smile.png

The min/max z value can be obtained by launching 32x32 threads per tile (considering each tile as a thread group). Thus, for each pixel one thread is responsible. Then, each thread reads the pixel’s depth and interprets this float value as an uint (asuint intrinsic).
With the atomics InterlockedMin and InterlockedMax on a shared memory variable you can find the minimum and maximum depth value for each tile in parallel. (The atomics don’t work on floats, thus the nasty trick with the reinterpretation of the bits…) Afterwards you just sync the threads (GroupMemoryBarrierWithGroupSync). And now you can reinterpret the values again as a float (asfloat).
You also get the jist, if you look at the "DirectX 11 Rendering in Battlefield 3" slides from DICE, which can be found here.


AFAIK, an atomic operation is a single operation that can't be broken (interrupted by another thread).

Correct!


I see how making a bounding box can help detecting sphere collisions. Combined with other checks it should cull pretty well. But for cones... Making a sphere out of a cone produces a lot of unused volume...

Maybe you find something useful here.
What about approximating the lights by frustums, too and then do frustum-to-frustum tests?

Best regards

Share this post


Link to post
Share on other sites
Didn't see the source code, got something nice to study tonight, thanks!
I'll be studying links this info. And likely I'll be back with questions later tonight/tomorrow ;)

Share this post


Link to post
Share on other sites
Ok, I had a look into that source, though it was a bit much to get a quick understanding. At least for me, my head has limited bandwidth :)

It required re-reading the example "ComputeShaderTile.hlsl" file several times with your explanation next to it, but it makes sense now:
1- Declare a min/max variable shared amonst the entire group, a 32x32 pix tile in mu case
2- The compute shader program will be executed for each pixel separate.
3- Each thread reads the depth, or actually mutliple samples due MSAA
4- Wait until all group threads are done
5- Do an atomic min/max with the shared group "min/max" variables and our own depth value we read in step 3

Now it's just a matter of finding the OpenCL variants on the Interlock / GroupBarrier functions and declaring shared group variables.



Next step, the culling. What I don't completely understand in the example... the Light Culling happens in the same program, right after the Z values have been calculated (and synced). Doesn't this mean ALL screen pixels are actually computing whether they are affected by light-X or not? Ifso, I can use the pixel-world position to perform pretty accurate culling techniques. But then again, I guess when doing relative complex calculations for each pixel, I'm missing the whole performance-gain.

Maybe I'm not reading the code well... As said, it always takes time before I get the clue :P

Share this post


Link to post
Share on other sites
Hi Spek,


Next step, the culling. What I don't completely understand in the example... the Light Culling happens in the same program, right after the Z values have been calculated (and synced). Doesn't this mean ALL screen pixels are actually computing whether they are affected by light-X or not?

The beauty of the tile-based lighting is that the culling of the lights is parallelized (I get to that in a minute) and that each pixel afterwards only iterates over the “contributing” (non-culled) lights.
Even better, we only accumulate the colors in local memory (in standard deferred lighting it is global memory, with rasterizer and blending overhead) and the G-buffer is only accessed once (in standard deferred lighting it is once per light), which saves a lot bandwidth. This is where the huge benefit comes from.

Let’s see. We already covered that for each pixel, there is one thread. A tile is made of 32x32=1024 threads. Now the idea is to take advantage of the fact that most of the pixels in a tile will be affected by the same lights. So, instead of culling the lights against the pixels, we just cull them against the whole tile (which is why we computed the min/max depth of the tile). Sure, this test is a little coarse, but the coherence between the pixels is usually quite high.
So let’s say we have for example 4096 lights. We also have 1024 threads per tile (32x32). This means, for culling those 4096 lights we have 1024 threads available. Thus, each thread only needs to cull 4 lights and we got them all. smile.png All we need to do is to iterate interleaved over the light array and cull the hand full of lights (note that in the loop we have lightIndex += COMPUTE_SHADER_TILE_GROUP_SIZE).
The first thread would cull light number 0, 1024, 2048, ...
The second thread then 1, 1025, 2049, ... and so on. smile.png
Every light that does intersect with the frustum is added to a list (in shared memory). And eventually, after synchronizing, every threads gets back to its original responsibility: the pixel it was launched for and iterates over the light list stored in shared memory (which is available to the whole tile) to compute and accumulate the actual lighting.

I hope this makes it a little clearer. smile.png
Cheers!

Share this post


Link to post
Share on other sites
And thanks again! This sure clears up the way how workgroups and such work. Too bad though it seems my laptop doesn't support atomic operations (2009 hardware). OpenCL smacks me with an "Unsupported operation" message when trying atom_add / atom_inc or the likes. Is there a work-around, other than downscaling the Z-buffer with an ordinary fragment shader like I did so far?

The example indeed loops on a certain "interval", rather than checking all lights for each thread. I won't be having that much lights though. Let's say I have 10 lights only, I could only let the first 10 threads check 1 light each, then let the other 1014 threads do nothing.

I think I get a bit now, how the workload is divided on the groups and their threads. Got to find something on the atomic operations, and figure out the culling math now.



@Frenetic Pony
Crap, each time I try to learn one technique, two new ones pop up smile.png I'll first try to master Tiled Deferred Lighting before proceeding hehe, but thanks for the paper nonetheless!


edit
-----
Another sub, OpenCL-related question. Let's say my screen is 640 x 320 pixels, and I want to divide it in 20x10 tiles, each using 32x32 threads. I thought you would run it like this then

clEnqueueNDRangeKernel( clCommandQueue, kernel, 2, //dimensions
null, // offset
pointerTo[20,10], // global worksize
pointerTo[32,32], // local worksize
0, null, null);

But this doesn't work, for one reason because the local-worksize can't be bigger than the global worksize... Probably I'm confusing things.
- Found it out. The global worksize shouldn't be the number of workgroups, but the total amount of the work/items. And for that reason, the worksize must be a number smaller AND dividable through the worksize. Thus in this case, [640,320] should be the global worksize.
Also make sure the local worksize isn't too big. My computer could only use 16x16 groups...

Greets Edited by spek

Share this post


Link to post
Share on other sites
@spek

No worries! The paper is based on tiled anyway, and is only really beneficial if you have thousands of lights. If you've only got a few hundred, then you should stick with tiled. Edited by Frenetic Pony

Share this post


Link to post
Share on other sites
There won't be hundreds of lights, unless we'll do something with VPL's in the future for ambient, but I doubt that.

I managed to write a Compute Shader that downscales, culls, and lits the pixel all in a single program. Although it doesn't really cull yet, I'm saving the hardest parts for later :P OpenCL isn't giving an easy time btw, strange compiler bugs (on comment lines!) several times, and "normalize()" caused compiler errors as well until I downloaded the latest nVidia drivers. Seems it's still a bit buggy.


It compiles now, yet one nasty problem remains: I can't perform atomic operations (probably my videocard is a bit too old for that). Unless someone has smart work-arounds, I woudln't know how to synchronize workgroup-variables (like getting the min/max Z per group, or adding lamps to a shared array).

Nonetheless, all tips here have been very helpful to get a better understanding of Compute Shaders!

Share this post


Link to post
Share on other sites
Hi!


I can't perform atomic operations (probably my videocard is a bit too old for that). Unless someone has smart work-arounds, I woudln't know how to synchronize workgroup-variables (like getting the min/max Z per group, or adding lamps to a shared array).


I think you have Dx10-class hardware, do you? Unfortunately, there is no way to use atomics on this class of hardware. :-(

Instead you could use a parallel reduce to find the maximum and minimum depth. This would require only O(log(n)^2) instead of O(n^2) (brute force = iterating over all pixels). Therefore you iteratively compare the values in a tree-like manner. Either you launch the compute shader with a fixed number of threads and do the loop in the shader (causing with each iteration more threads to idle) or you launch each iteration a new compute shader with the actual required number of threads. Here is a short sample to give you the idea:
2 9 3 8
|/ |/
9 8 // launch 2 threads
| /
| /
9 // launch 1 thread


As for the construction of the light list, I’d probably use a geometry shader for that. For each tile, draw the lights as vertices (using the light buffer as vertex buffer). Then, you could do the culling in the geometry shader and stream a list of visible lights via stream output to another buffer. In a compute shader, launched for this tile, you can read once the G-buffer data and then iterate for each pixel over the light list. At least this saves you the G-buffer reading bandwidth problems (at the costing of introducing bandwidth problems someplace else), since you only read once from the G-buffer (same thing would be true if you just render a viewport quad and compute the lighting for each light without culling any of them).

Though, since you do the whole process for each tile individually, you lose much of the benefit. sad.png
Well, at least the geometry shader would have to be done per tile individually.

You could speed up the processes a lot if you’d have syncs available in order to make proper use of shared memory (parallel reduce, reading of the light list...).

Unfortunately, no better way came to my mind yet. I seriously doubt that this is going to be faster than a standard deferred renderer, especially if you only have a few lights... So, I guess, tile-based on Dx10 hardware isn't a good idea.

Though, the rendering architects at DICE (Christina Coffin among others) developed a tile-based approach for the PS3. Maybe you can find some inspiration there? Edited by Tsus

Share this post


Link to post
Share on other sites
DX10 hardware indeed. Thanks for sharing ideas, but as you say, I'm afraid it gives quite a lot overhead, hence the programming effort. The best solution, but a bit risky, is to drive over the laptop with a truck, and hope the boss pays a new one with DX11 hardware. Normally my laptops die after 1,5 years or so, but this one seems indestructable :|

In the meanwhile, I'll take another look at the PS3 BF3 sheets, but so far I don't see any notes about syncing in this paper:
http://www.slideshare.net/DICEStudio/spubased-deferred-shading-in-battlefield-3-for-playstation-3
Then again, neither I'm familiar with SPU's or any other PS3 implementations.

Wait... I'll dust of my secret weapon... the desktop PC somewhere hidden in a corner of the house, which has DX11 hardware.

Thanks!

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement