Jump to content
  • Advertisement
Sign in to follow this  

DirectCompute Blur Question

This topic is 2616 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am looking at page 16 of

Efficient Compute Programming

where they discuss implementing Gaussian Blur with compute shader. They say they will have 128 threads, and 128 of them will read in one texel value into thread local storage. However, 2*KernelRadius of the threads will read in an additional texel value into thread local storage. Then we can process 128 pixels per thread group.

However, how do you program this so that 2*KernelRadius of the threads will read in an additional texel value without using branch instructions? It seems like you somehow have to identify the 2*KernelRadius threads that will read in the extra texel, then if a thread falls in that set, it will load the additional texel. Is the overhead of branching worth fixing the redundant thread problem?

I've read that branching per thread like this kills performance, and you should only branch on the group ID since that means all threads take the same path.

Share this post

Link to post
Share on other sites
The issues around branching and performance are not as clear cut as they might first appear and it is mostly a question of trade offs; in this case we are 'wasting' a small amount of compute power in order to aid bandwidth reduction for the whole algo.

The 'if' statement itself is (practically) free; the ALU units perform the compare however the jumping to the correct piece of program code is handed by the sequencer which lives outside the ALU units. In the case of loading in some data this means that some threads will be masked out while the 'if' branch is taken which does waste some compute power; however this is offset by the overall increase in effiency of using local store memory later on and the reduction in memory bandwidth requirements.

Now, the problem with branching itself is that if you get some threads which go one one and some which go the other then you have to execute both paths, with some masked out. This means that if you 'if' path was 16 cycles long and your 'else' was '8' then wavefronts (to use AMD's term) which take those routes will end up burning 24 cycles which is why you need to be careful with them.

In this case however because we are only executing an 'if' statement the overhead is small and consistant across all thread groups which, combined with the aforementioned advantages, makes the cost worth while.

Also, this class of problem tends to be bottlenecked by memory bandwidth at start up anyway, so you'll probably end up stalling out some threads waiting for data to appear so the 'cost' of the extra load instructions is going to be small in the grand scheme of things.

And yes; high frequency branching certainly hurts performance as for an single if statement your runtime cost is always the total of both paths.

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!