Raytracing specific 3D Light Culling Access structure

Started by
2 comments, last by ownthewin 9 months ago

hi,

in my "raytracing only" game engine i want to implement a many light culling access structure (LCAS)

given:
- structured buffer of all scene lights ( point, spot, direct, area) ( scene light list)
- every light has
o defined max_distance
o is tagged as dynamic or static

- support for up to 1000 lights

needed:
- global kind of "list" of light cluster indices (world cluster light list)
- e.g. whole game world is partitioned in a grid 100 * 100 * 5

- every cell holds
o start index
o count

of the lights, that can possibly affect any given shading point in the world

Only the distance of the world cluster to the shading point is relevant.
No world normal or light direction checking

per level preprocessing:
- create LCAS ( static lights)

per frame processing:
- update LCAS ( dynamic lights )

per pixel processing:
- loop 2 times ( static, dynamic )
- find cluster by shading world coordinate
- get light entry start index
- loop over the light candidates and shade

How is this done best ?

To me it seams best to have 2 world LCAS ( static and dynamic lights separated)
because of the update time that is needed for the dynamic lights

suppose ( 100 x 100 x 5 x 16 light entries) fixed grid

problem:

Most easy would be to have a really big arrays but

most of the array would be empty and the max count of entries here 16 follows the cluster with maximum lights in it

Should i better use a 2 level hierarchy and how can it be fast to update ?

Advertisement

evelyn4you said:
To me it seams best to have 2 world LCAS ( static and dynamic lights separated)

I'd rather assume the opposite. Building acceleration structures for just 1000 lights should be very cheap, and you'll have many more hitpoints to shade than lights. So it should pay off to merge static and dynamic lights each frame.

evelyn4you said:
suppose ( 100 x 100 x 5 x 16 light entries) fixed grid

You can get away with just 100 x 100 x 5 for the spatial grid.
Each cell would have only an index to the first light, e.g. 105.
To get the end of the list of lights you take the index of the next cell, e.g. 108.
Then you know the lights in current cell are 105, 106, 107.

To make this work, both the grid cells and the binned lights must be ordered continuously in memory, which is usually the case anyway because of the way we do the binning / sorting.
E.g. for a 3x3 grid the cells are in memory like so:

0,1,2,
3,4,5,
6,7,8, plus eventually one extra cell to determine the count of lights in the last cell

We also have 5 lights for example, and i'll label them with letters and bin to grid cells like so:

-,	A,	-,
DE,	B, -,
C, -, -

The spatially sorted list of lights so becomes:
A,D,E,B,C
replacing letters with the now known memory indices we obviously get:
0,1,2,3,4

Now showing the indices we place in our grid, which is basically a prefix sum:

0,0,1,
1,3,4,
4,5,5, 5

If you're not familiar yet with using prefix sums for binning, it's maybe a bit confusing.
But basically we advance a counter over the cells in order, increasing it by the number of lights in each cell.
This can be done very efficiently with parallel programming on GPU, and prefix sums or other such ‘scan’ algorithms are the most important building block for parallel programming at all, i would say.
Really essential to build spatial acceleration structures. (Again recommending CS chapter from OpenGL Super Bible book, explaining those things very well.)

But now let's test our stuff. E.g. i want to iterate the two lights of cell (0,1):

int cellIndex = 1 * 3 + 0;
int listBegin = prefixSumBuffer[cellIndex];
int listEnd = prefixSumBuffer[cellIndex+1];

for (int i=listBegin; i<listEnd; i++)
{
	int lightIndex = KnownMemoryIndicesOfLights[i];
	// this will give us 1 and 2, so D and E
}

As you see, we use minimal memory and we get ideal access patterns. So that's the goal.

But to get there, i need to explain how to build the grid, so the binning process itself.
First, for each light we increase a counter for the cell it occupies. (Ideally you can do this with atomicAdd on a grid fitting into LDS.) We would get:

0,1,0,
2,1,0,
1,0,0, 0

But i prefer to increase the counter of the next cell instead, giving:

0,0,1,
0,2,1,
0,1,0, 0

(Both works, but the latter requires to add one more ‘cell’ at the back as shown, while the former would require to add one more cell to the front to hold a zero, practically requiring a branch to deal with the first cell special case.
This is also related to the detail of ‘inclusive vs. exclusive prefix sum’.)

Then we do the prefix sum. With sequential code that's simply:

for (int i=1; i<=3*3; i++)
	prefixSumBuffer[i] += prefixSumBuffer[i-1];  

With parallel code that's not too complicated either, see the book or google for GPU prefix sum. We get:

0,0,1,
1,3,4,
4,5,5, 5

As shown initially above.

After that we iterate the lights a second time, and we can use the known list start points to reorder the lights in memory so they form a compact, continuous, and sorted list of lights.

Take some time to go through this. It's trivial but still some stuff.
And that's a really good example to go from ‘brute force pixel and vertex shaders’ to real parallel programming using compute shaders.

evelyn4you said:
Should i better use a 2 level hierarchy and how can it be fast to update ?

Good question. I would rephrase it to ‘Should we place one light into just one cell, or do we need to put one light into multiple cells?’

The example i gave is for the former and simpler case. It's good also because we know memory requirements in advance.
To deal with large lights, we can use a multi level grid, using larger cells for larger lights. (We always pick the level so a light fits into a square formed by 2x2 cells)
To find all lights potentially affecting a point, we need to iterate the 2x2 closest cells for each level.
It works, but we need to iterate multiple levels, which will cause execution divergence on GPU.
So it's likely worth to use just one level, but put a light into all cells it covers. However, that's more complicated mainly because memory requirements are unbound. So you need some max cap, which level designers need to know and respect to prevent failure cases.

If this is all new to you, i would start with the simple multi level grid. Because no problem solving gets in your way while learning the parallel programming stuff.

hello JoeJ,

thank you very much for your very informative post.

I was so busy with other problems, that i did not spend on this topy and forgot to give you a positive feedback. ( please forgive me )
It helped me a lot to understand and how to keep the size of the “reserved vram small”, because my intention is to put one light in multiple cells.

I will study the references you gave because your proposed method will lead to a “two pass” => spread solution.

I normally design all my effects as “gather methods”. This would mean here to “loop over all cells” vs "looping over all lights"

When i have time to code i will give more feedback.




This topic is closed to new replies.

Advertisement