uint idx1 = FancyCompute(data1fromCB);
float4 a = tex.Load[idx1];
float ra = compute(a);
uint idx2 = FancyCompute(data2fromCB);
float4 b = tex.Load[idx2];
float rb = compute(b);
uint idx3 = FancyCompute(data3fromCB);
float4 c = tex.Load[idx3];
float rc = compute(c);
uint idx4 = FancyCompute(data4fromCB);
float4 d = tex.Load[idx4];
float rd = compute(d);
should be better than
uint idx1 = FancyCompute(data1fromCB);
uint idx2 = FancyCompute(data2fromCB);
uint idx3 = FancyCompute(data3fromCB);
uint idx4 = FancyCompute(data4fromCB);
float4 a = tex.Load[idx1];
float4 b = tex.Load[idx2];
float4 c = tex.Load[idx3];
float4 d = tex.Load[idx4];
float ra = compute(a);
float rb = compute(b);
float rc = compute(c);
float rd = compute(d);
or it really doesn't matter....
Well... both are better... because it depends.
The compiler may do (pseudo assembly):
mov reg0 [memory]
stall
reg1 = call compute( reg0 )
mov reg0 [memory]
stall
reg2 = call compute( reg0 )
mov reg0 [memory]
stall
reg3 = reg call compute( reg0 )
mov reg0 [memory]
stall
reg4 = call compute( reg0 )
In total it consumes 5 registers, and stalls 4 times. Occupancy is excellent.
Or the compiler may decide this:
mov reg0 [memory]
mov reg1 [memory]
mov reg2 [memory]
mov reg3 [memory]
stall
reg4 = call compute( reg0 )
reg5 = call compute( reg1 )
reg6 = call compute( reg2 )
reg7 = call compute( reg3 )
In total it consumes 8 registers, and stalls 1 time. But because it consumes more registers, occupancy may be worse (if reg count hits the threshold)
What is better? who knows. If you multiply this by 10; you may have that one version uses 21 registers and stalls 20 times with an occupancy of 10; while the other uses 80 registers but stalls only once with an occupancy of 3.
The compiler has heuristics at which at certain length it thinks one version should start outperforming the other; so it switches strategies. GPUs are incredibly complex and a lot of factors weight in.
That is of course, assuming your GPU can switch between strategies. If your code adds a dependency on data (indirections, branches, loops), the compiler may have no choice.
This isn't a thought exercise. You can literally watch GPUShaderAnalyzer switch strategies on Gaussian filter shaders. Try very low kernel radiuses, then try very large ones.
Also don't forget there's the in-betweens:
mov reg0 [memory]
mov reg1 [memory]
reg2 = call compute( reg0 )
reg3 = call compute( reg1 )
mov reg0 [memory]
mov reg1 [memory]
stall
reg4 = call compute( reg0 )
reg5 = call compute( reg1 )
This version consumes 6 registers and stalls 2 times. This is midway of both worlds.
This is why it's hard. We don't really know what's better until we've tried, profiled, picked the winner, and/or analyzed the patterns (that doesn't mean we don't have an intuition based on past experiences...).
For example the extremely optimized SeparableFilter11 demo from GPUOpen written by AMD the main shader has an occupancy of just 4 (for 17 taps); however because of how it was written most (if not all) of its stalls will not actually stall at all because there's a lot of ALU work between the memory loads; and if it actually stalls; an occupancy of 4 is enough to hide it.
Also because the occupancy is low, the cache doesn't get trashed.
That demo is an incredibly good test case to examine on how write optimized a shader. I suggest you study it and experiment with it.
If you have trouble following what is going on, I suggest you read Efficient Compute Shader Programming where the algorithm is explained.
Edit: Just in case there's a misunderstanding, stalls are not implicit. There is actually an instruction the compiler inserts (in GCN these instructions are S_WAITCNT LGKM_CNT & VM_CNT).