Jump to content

  • Log In with Google      Sign In   
  • Create Account


phantom

Member Since 15 Dec 2001
Online Last Active Today, 09:15 AM
****-

Topics I've Started

Compute Shader Fail

16 June 2013 - 02:09 PM

There was some confusion as to why our (very work in progress) tone mapping solution was running so slowly; on a simple scene which was GPU limited we were failing to hit 60fps on even a 470GTX.

After a couple of weeks of this I finally got ahead of work enough to have a look and very quickly found a few minor problems... dry.png

The code computes a histogram of the screen over a couple of stages and stages 1 and 2 were running very slowly and appeared to be horribly bandwidth limited (on a laptop with around 50% of the bandwidth of my card it was running 50% slower). Breaking out nSight gave us some nice timing information.

The first pass, which broke the screen up into 32x32 tiles, thus requiring 60x34 work groups (or 2088960 threads across the device) for a 1920*1080 source image, looked like this;
if(thread == first thread in group)
{
    for(uint i = 0; i < 128; ++i) // (1)
        localData]i] = 0;

}

if(thread == first across ALL the work groups) // (2)
{
    for(uint i = 0; i< 128; ++i)
        uav0[i] = 0;
}

if(thread within screen bounds)
{
    result = do some work;
    interlockAdd(localData[result], 1)
}

if(thread == first in group) // (3)
{
    for(uint i = 0; i < 128; ++i)
        uav1[i] = localData[i];
}
Total execution time on a 1920*1080 screen; 3.8ms+
Threads idle at various points;
(1) 1023
(2) 2088959
(3) 1023

Amusingly point (2) was to clear a UAV, used in a latter stage, and was there to 'save a dispatch call of a 1x1x1 kernel' o.O

Even a quick optimisation of;
(1) have each of the first 32 threads write the uints; more work gets done, bank conflicts avoided due to offsets of each thread doing the write
(2) having 32 threads write this data (although this shouldn't need to be done)
(3) having the first 32 threads per group write the data, same reasons as (1)

Resulted in the time going from 3.8ms+ down to around 3.0ms+ or a saving of approximately 0.8ms.

That wasn't the biggest problem... oh no!

You see, after this comes a stage where each of the data which is written per-tile is them accumulated into a UAV as the complete histogram of the scene. This means in total we needed to process 60x34 (2040) tiles of data with 128 unsigned ints in each tile.

buffer<uint> uav;
buffer<uint> srv;
if(thread < totalTileCount)
{
    for(uint i = 0; i < 128; ++i)
        interlockAdd(uav[i], srv[threadTile + i]);
}
In this case each thread reads in a single uint value from its tile and then does an interlock add to the accumulation bucket.

OR, to look at it another way;
1) each thread reads a single uint, with each thread putting in a request 128 uints away from all the others around it which is a memory fetch nightmare
2) each thread then tries to atomically add that value to the same index in a destination buffer, serialising all the writes as each one tries to complete.
3) each thread reads and writes 128*32bit or 512 bytes which, across the thread group, works out to be ~1meg in both directions. (512bytes * 2040 tiles in each direction with interlock overhead.)

This routine was timed at ~4ms on a 470GTX and only got worse when memory bandwidth was reduced on the laptop.

This one I did take a proper look at and changed to a reduce style operation so;
1) Each thread group was 32 threads
2) Each thread group handled more than one tile (20 was my first guess value for first thread, reduced on latter passes)
3) source and destination buffers where changed to uint4 types
4) For the first tile each thread group reads in the whole tile at once (32xuint4 reads) and pushes it to local memory
5) Then all the rest of the tiles are added in (non-atomically, we know this is safe!)
6) Then all threads write back to the thread group's destination tile
7) Next dispatch repeats the above until we get a single buffer like before

Per-pass each thread group now reads 512bytes per tile still but now only writes back 512bytes per tile.
Or;
1) 1Meg in, coalesced reads
2) 51kb out, coalesced writes

Amusingly this first try, which try took 3 passes, reduced the runtime from 4ms to 0.45ms in total happy.png

Moral of this story;
1) Don't let your threads idle
2) Watch your data read/write accesses!

WIP : A Lazy Sunday

10 February 2013 - 08:36 AM

Hey all,

 

Every so often I break out FL Studio, have a play around and see if anything comes from it, more often than not I get a short couple of bars which go no where so I stop and go and play a game instead :D

 

Today however I sat down, plotted out a little tune and worked around it and got something I'm reasonably happy with.

 

https://soundcloud.com/asylumsmile/a-lazy-sunday (So called because... well.. it's been a lazy sunday for me, heh)

 

Still a work in progress but some mixing has been done to it so that I can get a sound I'm happy with and get things 'settled' together.

 

Any and all feedback welcomed :)

 

 


Code blocks are broken in IE10

08 February 2013 - 05:49 AM

As apparently my mail to the mod list was ignored...

 

Basically the 'code' blocks are utterly broken on IE10 : instead of resulting in a formatted box of text it is reduced to a single line which scrolls off to the right.

 

The source blocks, as above, work fine but given that 'code' is what the button on the editor produces this isn't very helpful.

 

So, I ask again; is this known? Is it going to be fixed? If so when?


How do you work?

07 June 2012 - 04:34 AM

While I don't often fire up FLStudio when I do I find myself hitting the same problem time after time; I'll come up with a small 4 or 8 bar riff I'm happy with and then somewhat hit a wall when I find out I can't seem to figure out how to progress things.

So I was wondering how you guys out there do it?

Do you fire up a program and just mess about until you hit something and then run with it?
Do you go in with a tune/riff already in mind and an idea of where you want to go?

Or is it, like most things, just a matter of practise; playing around, seeing what works, and over time getting a feel for how you can combine things, what works well and how to progress?

LCD TV Advice..

05 August 2010 - 03:40 AM

Having recently moved into a new place I find myself in need of a TV for the first time in.. well, years... however I'm somewhat lacking of knowledge in this area so I thought where better than gd.net to get some advice? [grin]

I'm looking at something around 32" and I'd prefer a 1080p screen as having used screens before with both 1080p and 768 I prefer the former.

It'll be used for TV, DVD, Blu-ray and game playing so it'll certainly have some HD input thus the 1080p [grin]

Price wise... upto £500, although I could probably go a little higher than that if it was deemed worth it.

I'm currently considering a Panasonic Viera TV, as this is the same kind my parents have (if not a smaller version) as thus far I've liked the tv they have, but I'm open to other ideas...

So, any suggestions on who to get? who to avoid?

PARTNERS