Compute Shader Fail

Coding Horrors Community

Started by _the_phantom_ June 16, 2013 08:09 PM

1 comment, last by Ben Bowen 10 years, 9 months ago

11,263

Author

June 16, 2013 08:09 PM

There was some confusion as to why our (very work in progress) tone mapping solution was running so slowly; on a simple scene which was GPU limited we were failing to hit 60fps on even a 470GTX.

After a couple of weeks of this I finally got ahead of work enough to have a look and very quickly found a few minor problems...

The code computes a histogram of the screen over a couple of stages and stages 1 and 2 were running very slowly and appeared to be horribly bandwidth limited (on a laptop with around 50% of the bandwidth of my card it was running 50% slower). Breaking out nSight gave us some nice timing information.

The first pass, which broke the screen up into 32x32 tiles, thus requiring 60x34 work groups (or 2088960 threads across the device) for a 1920*1080 source image, looked like this;


if(thread == first thread in group)
{
    for(uint i = 0; i < 128; ++i) // (1)
        localData]i] = 0;

}

if(thread == first across ALL the work groups) // (2)
{
    for(uint i = 0; i< 128; ++i)
        uav0[i] = 0;
}

if(thread within screen bounds)
{
    result = do some work;
    interlockAdd(localData[result], 1)
}

if(thread == first in group) // (3)
{
    for(uint i = 0; i < 128; ++i)
        uav1[i] = localData[i];
}

Total execution time on a 1920*1080 screen; 3.8ms+
Threads idle at various points;
(1) 1023
(2) 2088959
(3) 1023

Amusingly point (2) was to clear a UAV, used in a latter stage, and was there to 'save a dispatch call of a 1x1x1 kernel' o.O

Even a quick optimisation of;
(1) have each of the first 32 threads write the uints; more work gets done, bank conflicts avoided due to offsets of each thread doing the write
(2) having 32 threads write this data (although this shouldn't need to be done)
(3) having the first 32 threads per group write the data, same reasons as (1)

Resulted in the time going from 3.8ms+ down to around 3.0ms+ or a saving of approximately 0.8ms.

That wasn't the biggest problem... oh no!

You see, after this comes a stage where each of the data which is written per-tile is them accumulated into a UAV as the complete histogram of the scene. This means in total we needed to process 60x34 (2040) tiles of data with 128 unsigned ints in each tile.


buffer<uint> uav;
buffer<uint> srv;
if(thread < totalTileCount)
{
    for(uint i = 0; i < 128; ++i)
        interlockAdd(uav[i], srv[threadTile + i]);
}

In this case each thread reads in a single uint value from its tile and then does an interlock add to the accumulation bucket.

OR, to look at it another way;
1) each thread reads a single uint, with each thread putting in a request 128 uints away from all the others around it which is a memory fetch nightmare
2) each thread then tries to atomically add that value to the same index in a destination buffer, serialising all the writes as each one tries to complete.
3) each thread reads and writes 128*32bit or 512 bytes which, across the thread group, works out to be ~1meg in both directions. (512bytes * 2040 tiles in each direction with interlock overhead.)

This routine was timed at ~4ms on a 470GTX and only got worse when memory bandwidth was reduced on the laptop.

This one I did take a proper look at and changed to a reduce style operation so;
1) Each thread group was 32 threads
2) Each thread group handled more than one tile (20 was my first guess value for first thread, reduced on latter passes)
3) source and destination buffers where changed to uint4 types
4) For the first tile each thread group reads in the whole tile at once (32xuint4 reads) and pushes it to local memory
5) Then all the rest of the tiles are added in (non-atomically, we know this is safe!)
6) Then all threads write back to the thread group's destination tile
7) Next dispatch repeats the above until we get a single buffer like before

Per-pass each thread group now reads 512bytes per tile still but now only writes back 512bytes per tile.
Or;
1) 1Meg in, coalesced reads
2) 51kb out, coalesced writes

Amusingly this first try, which try took 3 passes, reduced the runtime from 4ms to 0.45ms in total

Moral of this story;
1) Don't let your threads idle
2) Watch your data read/write accesses!

Kaptein

2,226

June 16, 2013 11:25 PM

Interesting, I can only imagine myself doing that, as I wouldn't have used any external tools at all, you know until that extremely rainy day where I couldn't for the life of me figure out what's wrong... Then in a final act of desperation I open up an external tool only to have the whole picture revealed, and you know, "from that day on..."

Not sure if i could use compute shaders for anything meaningful, since my projects are very gpu heavy graphics wise... or is that a mistake?

Especially if you consider that there's gotta be "some" waiting in the rendering thread... How would one know?