Jump to content

  • Log In with Google      Sign In   
  • Create Account


olaolsson

Member Since 14 Mar 2012
Offline Last Active Sep 13 2014 02:38 PM

#5116761 Difference Tiled and Clustered shading

Posted by olaolsson on 13 December 2013 - 02:59 PM

One probleme is to store light indices since the array is not always the same size and using the "max theorically" the memory size is pretty impossible.

So using a worst case must be used but, if we say we allow 1024 point lights and 1024 spot lights, a good pool is 4*1024*1024 ?

 

The way we implemented it is to allocate a 'reasonable' buffer and then to grow it when (if) needed.

I think Emil covered how they deal with this in our talk from Siggraph this year:

'Practical Clustered Deferred and Forward Shading'

 

This talk should provide a few insights into both the general gist of the algorithm and the practical implementation at Avalanche.

 

Hope it helps.

.ola




#5023603 Is Clustered Forward Shading worth implementing?

Posted by olaolsson on 20 January 2013 - 02:14 PM

Note that Forward+ (aka Clustered Forward, Light Indexed Deferred) is a very new topic and there's a lot of research coming up this year.

Now, just because I'd hate for this to turn into another deferred lighting / shading terminology kerfuffle:

 

Tiled Forward <=> Forward+, these use 2D tiling (same as Tiled Deferred), with a pre-z pass (optional) + separate geometry pass for shading.

Light Indexed Deferred, Builds the lists per pixel, which can be viewed as a 1x1 tile, and then it is really the same as Tiled Forward. The practical difference is pretty big, though...

Clustered Forward, performs tiling in 3D (or higher). othwewise as above.

Tiled/Clustered Deferred Shading, do tiling as their forward counterparts, but start with a G-Buffer pass and end with a deferred shading pass.

 

Hope this clears up, and/or prevents, some confusion.




#4927391 Revival of Forward Rending?

Posted by olaolsson on 02 April 2012 - 01:55 AM

...If you didn't do this you could build a list of lights just using the near/far planes of the camera, but I would suspect that the larger light lists + lack of good early z cull would cause performance to go right down the drain.


I did look at that in my paper 'Tiled Shading' that someone posted a link to above. And the short answer is that no indeed, it does not end too well.

On the other hand, I imagine that it would be a useful technique simply to manage lights in an environment with not too many lights in any given location and limited views (e.g. RTS camera or so), as the limited depth span makes the depth range optimization less effective.

I've got an open gl demo too, which builds grids entirely on the CPU (so it's not very high performance, just there to demo the techniques).

Btw, one thing that could may affect your results that I noticed is that you make use of atomics to reduce the min/max depth. Shared memory atomics on NVIDIA hardware serialize on conflicts, so to use them to perform a reduction this way is less efficient than just using a single thread in the CTA to do the work (at least then you dont have to run the conflict detection steps involved). So this step gets a lot faster with a SIMD parallel reduction, which is fairly straight forward, dont have time to dig out a good link sorry, I'll just post a cuda variant I've got handy, for 32 threads (a warp), but scales up with apropriate barrier syncs, sdata is a pointer to a 32 element shared memory buffer (is that local memory in compute shader lingo? Anyway, the on-chip variety.).

uint32_t warpReduce(uint32_t data, uint32_t index, volatile uint32_t *sdata)
{
  unsigned int tid = index;
  sdata[tid] = data;
  if (tid < 16)
  {
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid +  8];
    sdata[tid] += sdata[tid +  4];
    sdata[tid] += sdata[tid +  2];
    sdata[tid] += sdata[tid +  1];
  }
  return sdata[0];
}

Same goes for the list building, where a prefix sum could be used. Here it'd depend on the rate of collisions. Anyway, thinking this might be a difference between NVIDIA and AMD (Where I don't have a clue how atomics are implemented).

As a side note, it's much more efficient to work out the screen space bounds of each light before running the per tile checks, saves constructing identical planes for tens of tiles, etc.

Anyway, fun to see some activity on this topic! And I'm surprised at the good results for tiled forward.

Cheers
.ola


PARTNERS