Cascaded Shadows maps, texture atlas or texture array?

Graphics and GPU Programming Programming

Started by Funkymunky November 16, 2013 11:32 PM

10 comments, last by ADDMX 10 years, 4 months ago

1,420

Author

November 16, 2013 11:32 PM

In researching cascaded shadow maps, I see a lot of advice about putting them into an atlas. If they're passing in 4 MVPs anyway, I assume that they're using a geometry shader to emit 4 copies of the vertices. So why not just have the geometry shader specify the layer of a texture array to render the data into? What is the upside of using an Atlas?

MJP

20,295

November 16, 2013 11:39 PM

I don't know of anyone using GS amplification to to render geometry to multiple cascades simultaneously. While you can certainly do it, in my experience it's always been slower doing it this way compared to just rendering your geometry multiple times.

There's not really many upsides to using an atlas over an array, it was mainly used on hardware that predates texture arrays. Using an array simplifies things a lot and is just as fast.

The Blog | The Book

Funkymunky

1,420

Author

November 17, 2013 12:11 AM

Huh, that's surprising. I guess I'll implement both and profile it, but googling around about geometry shaders it sounds like that's the presiding opinion, even for current cards. Thanks!

MJP

20,295

November 17, 2013 02:03 AM

The GS is in general a slow path for the GPU. It doesn't map very well to the hardware, and any kind of amplification scenario results in lots of overhead and traffic to off-chip memory. It can definitely save you CPU overhead since you can potentially reduce draw calls by quite a bit, but very likely this will be at the expense of GPU performance.

The Blog | The Book

kauna

2,925

November 17, 2013 08:57 PM

The geometry shader can be used to define the destination texture array slice of a triangle, so it has its use.

So you can use geometry instancing when rendering the geometry and for different instances you can specify the destination slice. This way you'll save multiple draw calls at least per object. Otherwise the geometry shader only passes through the triangle data so no geometry amplification is done.

Cheers!

FreneticPonE

3,311

November 17, 2013 10:32 PM

If you want to do some profiling by all means share results!

LorenzoGatti

4,648

November 18, 2013 11:23 AM

There's not really many upsides to using an atlas over an array, it was manily used on hardware that predates texture arrays. Using an array simplifies things a lot and is just as fast.

Most literature about cascaded shadow maps predates availability of texture arrays; that's why there's a lot of "advice about putting them into an atlas" and very little about using texture arrays.

However, the choice between two ways to reduce texture switching is only a low level detail, which affects performance (maybe) but not shadow quality.

Omae Wa Mou Shindeiru

ADDMX

281

November 21, 2013 08:38 AM

I'v already done some experiments in my own engine and here are results (everything tested on complex scene (with large amount of small/medium scale vegetation objects (most of them instanced) with about 3 milion vertices casting shadows, 30% of them are skined, nvidia 680m, i7):

1] draw everything once, use GS to replicate vertices and use 2048x2048 texture with 4 1024x1024 quaters - SLOOOW - you need to use 2 custom clip planes to clip to the quater of atlas - GS is main botleneck (performance of whole frame around FPS = 21.1)

2] draw everything once, use GS to replicate vertices and use 1024x1024x4 texture array - SLOOW - but better than previous since no clipping planes are needed - GS is main bottleneck (FPS = 22.4)

3] draw everything once, into 2048x2048 texture with 4 1024x1024 quaters - this time for every drawcall multiply instances count by 4, and in VERTEX shader use (InstanceIndex&0x3) to output into specific quater of atlas (again 2 custom clip planes used) - FAST, FPS = 42.7 !!! (twice as fast as with GS path) - this time the bottleneck is in vertex shader for all those skined vertexes.

4] use texture array, but for each cascade submit their own set of draw calls, THERE _IS_ oprortunity to clip them independently, so the win is total number of vertices processed by VS (for points 1, 2 the total was 3M, for 3 it was 12M!, for 4 it was 6M) but the loos is total number of batches (for 1, 2, 3 it was 972, for 4 it was 1944)

FPS = 42.1 - if all is submited to base context, 44.1 if 4 deferred contexts are used and each is created on different sheduler task, then all of them are submited at once into base context)

for 3 there is probably chance to outperform 4 if some neat way of clipping is introduced, but for now i have no time for this and i'm stick with it as-is since i need to get with batches as low as possible since other parts of engine demands them.

FreneticPonE

3,311

November 21, 2013 10:04 PM

I'v already done some experiments in my own engine and here are results (everything tested on complex scene (with large amount of small/medium scale vegetation objects (most of them instanced) with about 3 milion vertices casting shadows, 30% of them are skined, nvidia 680m, i7):

1] draw everything once, use GS to replicate vertices and use 2048x2048 texture with 4 1024x1024 quaters - SLOOOW - you need to use 2 custom clip planes to clip to the quater of atlas - GS is main botleneck (performance of whole frame around FPS = 21.1)

2] draw everything once, use GS to replicate vertices and use 1024x1024x4 texture array - SLOOW - but better than previous since no clipping planes are needed - GS is main bottleneck (FPS = 22.4)

3] draw everything once, into 2048x2048 texture with 4 1024x1024 quaters - this time for every drawcall multiply instances count by 4, and in VERTEX shader use (InstanceIndex&0x3) to output into specific quater of atlas (again 2 custom clip planes used) - FAST, FPS = 42.7 !!! (twice as fast as with GS path) - this time the bottleneck is in vertex shader for all those skined vertexes.

4] use texture array, but for each cascade submit their own set of draw calls, THERE _IS_ oprortunity to clip them independently, so the win is total number of vertices processed by VS (for points 1, 2 the total was 3M, for 3 it was 12M!, for 4 it was 6M) but the loos is total number of batches (for 1, 2, 3 it was 972, for 4 it was 1944)

FPS = 42.1 - if all is submited to base context, 44.1 if 4 deferred contexts are used and each is created on different sheduler task, then all of them are submited at once into base context)

for 3 there is probably chance to outperform 4 if some neat way of clipping is introduced, but for now i have no time for this and i'm stick with it as-is since i need to get with batches as low as possible since other parts of engine demands them.