Sampling textures is something that GPUs are very specifically designed to be good at. It's cheaper than computation in a lot of cases.
Actually it is the contrary. Yes, sampling textures is very fast, however, computations are even faster. I can't find good reference here but ALU : Tex ratio is getting higher ie. shaders should do more calculations for each texture fetch. Nowadays even look up textures can be replaced with the actual calculations, at least in the simple cases.
Well, to be correct about this sampling textures may be fast depending on what is going on.
If you are lucky and your texture data sits in cache then yay! fast data return and all is good.
If you are unlucky and we have to make a request out to VRAM then noooo! slow data return as while GDDR has high bandwidth the latency is pretty damned poor - plus you are now queued with other requests.
The good news is that if the GPU has enough work to do then you might never notice these stalls as your thread group will get swapped out and someone else will get a run at the resources covering the latency of your data fetch. If you don't have enough work or other threads can't be swapped in however then the GPU will stall while it waits for data to come back so it can resume a thread group.
Note; thread group.
When thinking about branching you have to consider how it will effect all the threads running at a time in that group. The common group size on DX11 hardware (which I'll focus on as the topic was tagged DX11) is 32 or 64 threads to a wave front which is all the threads working in lock-step together.
When it comes to branching the rules are simple.
if(conditionA)
{
/* code block A */
}
else
{
/* code block B */
}
/* code block C */
If the GPU is executing this code then ALL threads will evaluate 'conditionA'.
If all threads evaluate 'conditionA' to 'true' then they will all run 'code block A' followed by 'code block C'.
If all threads evaluate 'conditionB' to 'false' then they will all run 'code block B' followed by 'code block C'.
The fun starts when some threads evaluate to 'true' and some to 'false', at which point the GPU does the following;
- mask out all lanes which evaluated 'false'
- run 'code block A'
- mask out all lanes which evaluated 'true'
- run 'code block B'
- mask all lanes back in
- run 'code block C'
As you can see the GPU has to execute both sides of the 'if' condition which, depending on the code in blocks A and B could cause a major performance hit as it has to do all the work in both branches (texture fetches might not happen, depending on arch and other issues, so you can potentially save some bandwidth/texture cache).
That said, if you know your data and you know how things are going to branch on a group basis then branch can help in certain cases (even on DX9 hardware where I've had wins doing it).
The best example of this was on a game I worked on where our roads where draw as a mesh being overlaid onto the world best and blended in. The road texture consisted of a large section in the middle which was alpha=1, a few transition texels where alpha tended towards 0 and a few pixels where alpha=0 along the sides.
When applied to the terrain large portions of the road mesh where covered with the alpha=0 part and others where alpha=1; by placing an 'if' statement on the initial alpha value (sampled from a texture) large amounts of work could be skipped (the road had other textures applied to it) and the pixels discarded (saving blending).
This worked well because inside a pixel quad the majority of the threads either had everyone with alpha=0 or alpha=1 with only the small border section varying and as the 'else' case was a simple 'discard' statement the cost from running both code paths in that instance was small resulting in a large performance increase on the target hardware.
The key point to get across here is that branching CAN help but you have to be aware of the data you are feeding it and the amount of work to do.
In this specific case I probably wouldn't branch as you don't really have enough work to do anyway; if the texturing statement was a bit more complex AND the colour == (0,0,0,0) has a frequency such that there was a high chance all the threads in a group would take one path or the other then I'd be tempted to use it.
Branching has it's uses, we are beyond the days of it totally destroying your performance if used, but you still have to use it sensibly.
And yes, burning ALU time is often better than burning bandwidth because, as mentioned, the ALU:tex ratio has long been shifting in that direction; modern GPUs are ALU monsters frankly.