Sign in to follow this  
Quat

DirectCompute Blur Question

Recommended Posts

Quat    568
I am looking at page 16 of

[url="http://developer.amd.com/gpu_assets/Efficient%20Compute%20Shader%20Programming.pps"]Efficient Compute Programming[/url]

where they discuss implementing Gaussian Blur with compute shader. They say they will have 128 threads, and 128 of them will read in one texel value into thread local storage. However, 2*KernelRadius of the threads will read in an additional texel value into thread local storage. Then we can process 128 pixels per thread group.

However, how do you program this so that 2*KernelRadius of the threads will read in an additional texel value without using branch instructions? It seems like you somehow have to identify the 2*KernelRadius threads that will read in the extra texel, then if a thread falls in that set, it will load the additional texel. Is the overhead of branching worth fixing the redundant thread problem?

I've read that branching per thread like this kills performance, and you should only branch on the group ID since that means all threads take the same path.

Share this post


Link to post
Share on other sites
_the_phantom_    11250
The issues around branching and performance are not as clear cut as they might first appear and it is mostly a question of trade offs; in this case we are 'wasting' a small amount of compute power in order to aid bandwidth reduction for the whole algo.

The 'if' statement itself is (practically) free; the ALU units perform the compare however the jumping to the correct piece of program code is handed by the sequencer which lives outside the ALU units. In the case of loading in some data this means that some threads will be masked out while the 'if' branch is taken which does waste some compute power; however this is offset by the overall increase in effiency of using local store memory later on and the reduction in memory bandwidth requirements.

Now, the problem with branching itself is that if you get some threads which go one one and some which go the other then you have to execute both paths, with some masked out. This means that if you 'if' path was 16 cycles long and your 'else' was '8' then wavefronts (to use AMD's term) which take those routes will end up burning 24 cycles which is why you need to be careful with them.

In this case however because we are only executing an 'if' statement the overhead is small and consistant across all thread groups which, combined with the aforementioned advantages, makes the cost worth while.

Also, this class of problem tends to be bottlenecked by memory bandwidth at start up anyway, so you'll probably end up stalling out some threads waiting for data to appear so the 'cost' of the extra load instructions is going to be small in the grand scheme of things.

And yes; high frequency branching certainly hurts performance as for an single if statement your runtime cost is always the total of both paths.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this