Will Compute thread stall on memory write?

Started by
7 comments, last by Matias Goldberg 7 years, 2 months ago

Hey Guys,

From what I know, more warp/wavefront scheduled in one CU (higher occupancy), the better performance when you have lots of memory read in your shader: when one warp stalls on waiting memory CU will switch to other warp to keep itself busy, thus more warp in CU means less chance you CU will idle on waiting for mem read.

But how about the case you only have memory write in your shader? Theoretically memory write shouldn't stall your warp since nothing in the later instruction is depend on write inst. But since mem write only have limited bandwidth, it probably will affect warp execution somehow... but how? And if warp won't be stall by mem write, does that means you don't have to struggle a lot to fit more warp into a CU's budget? (since no stall means you don't need to switch to other warp to keep CU busy)

Thanks

Advertisement

Writes are less sensitive to latency than reads, but that doesn't mean that you still don't want some latency hiding.

Most (if not all) hardware is going to have some fixed limit on how many UAV write operations can be performed per second and if that limit is hit then a wavefront may have to stall before it gets chance to execute the write/store operation. If you have an occupancy of only 1 on every SIMD then any time spent waiting to store is time not spent doing ALU operations to calculate the next value you want to write out. It would be a very unusual shader if it only wrote data and spent no time reading or calculating anything. Something like a Compute Shader used to 'memset' memory might be the closest thing to a shader that could get away with an occupancy of only 1.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Memory Reads stall because either the data hasn't arrived yet by the time it is needed (latency), or the memory BW limit has been hit.

Memory Writes stall because either the memory BW limit has been hit, or some other HW limit of write operations per second has been hit.

In an analogy, latency is the answer to the question "how long it takes for a truck to deliver a cargo from one city to another", while bandwidth answers the question "how much luggage you can fit in the truck to make it in one trip".

The main reason shaders stall is because of latency ("the truck hasn't arrived yet").

If the reason for the stalls is instead because you're hitting the BW limit ("the trucks are full; not everything fits in one trip"), then switching to a different shader won't help much unless one shader is using a lot of BW while the other uses a lot of ALU.

Btw, higher occupancy does not always mean higher performance. Switching too much between shaders can trash the caches. Occupancy is one of the best ways to hide latency, but as the article explains, it's not the only one.

UAVs goes thru the L2 cache and this also can be a limiting or/and improving factor if you touch too many lines or always the sames. On top of the cache miss/hit, there is also a limit of lines request that can be enqueued, so if you miss and the L2 cache already reach the limit of request, you will have to wait for a slot to free.

Thanks guys for all your replies. So it seems no matter your warp is stalled by mem read or write, inorder to hide latency, you need to make sure your shader is interleaved with ALU and MEM inst. Otherwise even you have high occupancy, switch to another warp which also stalled doesn't help at all, right?

So when writing shader code, should we manually organize our code so that memory read is scattered with independent ALUs? for example:


uint idx1 = FancyCompute(data1fromCB);
float4 a = tex.Load[idx1];
float ra = compute(a);

uint idx2 = FancyCompute(data2fromCB);
float4 b = tex.Load[idx2];
float rb = compute(b);

uint idx3 = FancyCompute(data3fromCB);
float4 c = tex.Load[idx3];
float rc = compute(c);

uint idx4 = FancyCompute(data4fromCB);
float4 d = tex.Load[idx4];
float rd = compute(d);

should be better than


uint idx1 = FancyCompute(data1fromCB);
uint idx2 = FancyCompute(data2fromCB);
uint idx3 = FancyCompute(data3fromCB);
uint idx4 = FancyCompute(data4fromCB);

float4 a = tex.Load[idx1];
float4 b = tex.Load[idx2];
float4 c = tex.Load[idx3];
float4 d = tex.Load[idx4];

float ra = compute(a);
float rb = compute(b);
float rc = compute(c);
float rd = compute(d);

or it really doesn't matter....

Thanks

Don't bother with your shader code on these things, the hlsl compiler and later driver compiler will manipulate that so much that it does not matters.what you write. Only profiling of both would tell you what is the better solution.

And only in the context of your profiling, maybe in some cases, depending of what may run along side or not, one or the other is the answer. In your case, keeping the 4 float 4 load, may put pressure on the register count, but it is possible that the driver would reorder the load ( unless a forced barrier you are definitely not use in a case like that ).


uint idx1 = FancyCompute(data1fromCB);
float4 a = tex.Load[idx1];
float ra = compute(a);

uint idx2 = FancyCompute(data2fromCB);
float4 b = tex.Load[idx2];
float rb = compute(b);

uint idx3 = FancyCompute(data3fromCB);
float4 c = tex.Load[idx3];
float rc = compute(c);

uint idx4 = FancyCompute(data4fromCB);
float4 d = tex.Load[idx4];
float rd = compute(d);

should be better than

uint idx1 = FancyCompute(data1fromCB);
uint idx2 = FancyCompute(data2fromCB);
uint idx3 = FancyCompute(data3fromCB);
uint idx4 = FancyCompute(data4fromCB);

float4 a = tex.Load[idx1];
float4 b = tex.Load[idx2];
float4 c = tex.Load[idx3];
float4 d = tex.Load[idx4];

float ra = compute(a);
float rb = compute(b);
float rc = compute(c);
float rd = compute(d);
or it really doesn't matter....

Well... both are better... because it depends.

The compiler may do (pseudo assembly):


mov reg0 [memory]
stall
reg1 = call compute( reg0 )

mov reg0 [memory]
stall
reg2 = call compute( reg0 )

mov reg0 [memory]
stall
reg3 = reg call compute( reg0 )

mov reg0 [memory]
stall
reg4 = call compute( reg0 )

In total it consumes 5 registers, and stalls 4 times. Occupancy is excellent.

Or the compiler may decide this:


mov reg0 [memory]
mov reg1 [memory]
mov reg2 [memory]
mov reg3 [memory]

stall

reg4 = call compute( reg0 )
reg5 = call compute( reg1 )
reg6 = call compute( reg2 )
reg7 = call compute( reg3 )

In total it consumes 8 registers, and stalls 1 time. But because it consumes more registers, occupancy may be worse (if reg count hits the threshold)

What is better? who knows. If you multiply this by 10; you may have that one version uses 21 registers and stalls 20 times with an occupancy of 10; while the other uses 80 registers but stalls only once with an occupancy of 3.

The compiler has heuristics at which at certain length it thinks one version should start outperforming the other; so it switches strategies. GPUs are incredibly complex and a lot of factors weight in.

That is of course, assuming your GPU can switch between strategies. If your code adds a dependency on data (indirections, branches, loops), the compiler may have no choice.

This isn't a thought exercise. You can literally watch GPUShaderAnalyzer switch strategies on Gaussian filter shaders. Try very low kernel radiuses, then try very large ones.

Also don't forget there's the in-betweens:


mov reg0 [memory]
mov reg1 [memory]

reg2 = call compute( reg0 )
reg3 = call compute( reg1 )

mov reg0 [memory]
mov reg1 [memory]

stall

reg4 = call compute( reg0 )
reg5 = call compute( reg1 )

This version consumes 6 registers and stalls 2 times. This is midway of both worlds.


This is why it's hard. We don't really know what's better until we've tried, profiled, picked the winner, and/or analyzed the patterns (that doesn't mean we don't have an intuition based on past experiences...).

For example the extremely optimized SeparableFilter11 demo from GPUOpen written by AMD the main shader has an occupancy of just 4 (for 17 taps); however because of how it was written most (if not all) of its stalls will not actually stall at all because there's a lot of ALU work between the memory loads; and if it actually stalls; an occupancy of 4 is enough to hide it.

Also because the occupancy is low, the cache doesn't get trashed.

That demo is an incredibly good test case to examine on how write optimized a shader. I suggest you study it and experiment with it.

If you have trouble following what is going on, I suggest you read Efficient Compute Shader Programming where the algorithm is explained.

Edit: Just in case there's a misunderstanding, stalls are not implicit. There is actually an instruction the compiler inserts (in GCN these instructions are S_WAITCNT LGKM_CNT & VM_CNT).

Also because the occupancy is low, the cache doesn't get trashed.

Thanks Matias for such a detailed reply and the links ! But I get a little bit confused, when GPU will actually switch warps? only when one is stalled? or each warp only have limited GPU execution time slot like CPU-thread model?

Big thanks

Also because the occupancy is low, the cache doesn't get trashed.

Thanks Matias for such a detailed reply and the links ! But I get a little bit confused, when GPU will actually switch warps? only when one is stalled? or each warp only have limited GPU execution time slot like CPU-thread model?

Big thanks

Usually when S_WAITCNT instructions are reached.

You can look at Pyramid's GCN Simulator that simulates the execution of GCN ISA to gather performance stats, including wave switching. It may not be perfect but the developer did a fair amount of research to get the simulator to be as accurate as reasonably possible.

This topic is closed to new replies.

Advertisement