# Suspicious slow compute shader... Mem write bottleneck?

## Recommended Posts

Posted (edited)

Hey Guys,

I come here to borrow you guys' insightful eyes again to help me find what I did wrong...

I have a compute shader which will take 5 input Texture2D (they all 512x424) and then do bunch of computation and output 28 float for each input pixel position. Here is the shader:

#include "FastICP.inl"
#include "CalibData.inl"
Texture2D<uint> tex_srvKinectDepth : register(t0);    //R16_UINT
Texture2D<uint> tex_srvTSDFDepth : register(t1);      //R16_UINT
Texture2D<float4> tex_srvKinectNormal : register(t2); //R10G10B10A2_UNORM
Texture2D<float4> tex_srvTSDFNormal : register(t3);   //R10G10B10A2_UNORM
Texture2D<float> tex_srvWeight : register(t4);        //R8_UNORM

RWStructuredBuffer<float4> buf_uavData0 : register(u0);//CxCx,CxCy,CxCz,Ctr
RWStructuredBuffer<float4> buf_uavData1 : register(u1);//CxNx,CxNy,CxNz,CyCy
RWStructuredBuffer<float4> buf_uavData2 : register(u2);//CyNx,CyNy,CyNz,CyCz
RWStructuredBuffer<float4> buf_uavData3 : register(u3);//CzNx,CzNy,CzNz,CzCz
RWStructuredBuffer<float4> buf_uavData4 : register(u4);//NxNx,NxNy,NxNy,CxPQN
RWStructuredBuffer<float4> buf_uavData5 : register(u5);//NyNy,NyNz,NzNz,CyPQN
RWStructuredBuffer<float4> buf_uavData6 : register(u6);//NxPQN,NyPQN,NzPQN,CzPQN

void AllZero(uint uIdx)
{
buf_uavData0[uIdx] = 0.f;
buf_uavData1[uIdx] = 0.f;
buf_uavData2[uIdx] = 0.f;
buf_uavData3[uIdx] = 0.f;
buf_uavData4[uIdx] = 0.f;
buf_uavData5[uIdx] = 0.f;
buf_uavData6[uIdx] = 0.f;
}

float3 ReprojectPt(uint2 u2xy, float fDepth)
{
return float3(float2(u2xy - DEPTH_C) * fDepth / DEPTH_F, fDepth);
}

float GetNormalMatchedDepth(Texture2D<uint> tex_srvDepth, uint3 DTid)
{
return uAccDepth * -0.001f / 4.f;
}

{
uint uIdx = DTid.x + DTid.y * u2AlignedReso.x;
AllZero(uIdx);
return;
}
float4 f4KinectNormal = tex_srvKinectNormal.Load(DTid) * 2.f - 1.f;
// No valid normal data
if (f4KinectNormal.w < 0.05f) {
AllZero(uIdx);
return;
}
float4 f4TSDFNormal = tex_srvTSDFNormal.Load(DTid) * 2.f - 1.f;
// No valid normal data
if (f4TSDFNormal.w < 0.05f) {
AllZero(uIdx);
return;
}
// Normals are too different
if (dot(f4TSDFNormal.xyz, f4KinectNormal.xyz) < fNormalDiffThreshold) {
AllZero(uIdx);
return;
}
float fDepth = GetNormalMatchedDepth(tex_srvKinectDepth, DTid);
// p is Kinect point, q is TSDF point, n is TSDF normal
// c = p x n
float3 p = ReprojectPt(DTid.xy, fDepth);
float3 n = f4TSDFNormal.xyz;
float3 c = cross(p, n);

float3 cc = c.xxx * c.xyz; // Get CxCx, CxCy, CxCz
buf_uavData0[uIdx] = float4(cc, 1.f); // last element is counter

cc = c.yyz * c.yzz; // Get CyCy, CyCz, CzCz
float3 cn = c.x * n; // Get CxNx, CxNy, CxNz
buf_uavData1[uIdx] = float4(cn, cc.x);

cn = c.y * n; // Get CyNx, CyNy, CyNz
buf_uavData2[uIdx] = float4(cn, cc.y);

cn = c.z * n; // Get CzNx, CzNy, CzNz
buf_uavData3[uIdx] = float4(cn, cc.z);

fDepth = GetNormalMatchedDepth(tex_srvTSDFDepth, DTid);
float3 q = ReprojectPt(DTid.xy, fDepth);
float pqn = dot(p - q, n);
float3 cpqn = c * pqn; // Get cx(p-q)n, cy(p-q)n, cz(p-q)n

float3 nn = n.xxx * n.xyz; // Get NxNx, NxNy, NxNz
buf_uavData4[uIdx] = float4(nn, cpqn.x);

nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz
buf_uavData5[uIdx] = float4(nn, cpqn.y);

float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n
buf_uavData6[uIdx] = float4(npqn, cpqn.z);
return;
}


Though I know this is mem intensive, and know it will be a little bit slow, but with 512x424 input resolution, taking 10ms on GTX680m doesn't seems right.  Nvidia Nsight doesn't support 680m, so I can't get detailed perf data about where is the bottleneck (I know it must be mem write, but I don't think it will cause 10ms GPU time....or am I wrong?)

I kinda see I can change all the output UAV raw buffer to 64bit Typed buffer should help, but that means l lost precision.... So I think it's better first discuss with you guys before I try the typed one.

Also I was wondering maybe this pass is better using Pixel Shader since I didn't use LDS at all, and PS could use compressed write to RTs (Correct if I am wrong about that...)  which may help with the mem write, but my output data size may exceed num_of_RTs limits, so end up with multiple pass....

So please let me know if you see me doing something silly in the code, or you have any suggestions.

As always, big thanks in advance

Edited by Mr_Fox

##### Share on other sites

Is there a particular reason you need to go from a 2D data structure to a 1D structure?

I never like doing that from the point of view of getting a good cache access pattern on reads/writes.

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie:

nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz

float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n

buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z);
//buf_uavData6[uIdx] = float4(npqn, cpqn.z);


And see if removing just one of the stores has any effect on overall time.

The amount of bandwidth required to read/write that amount of data is ~26MB by my count, so doesn't worthy of 10ms on such a GPU.

The list of things I would try are:

1) Write to a 2D data structure to see how it affects GPU time.

2) Remove one, two, then three buffer stores, but accumulate the results into other buffers to see how much faster it gets if you write less data (and write less times).

3) Use a typed buffer to see if it really is the write bandwidth that's a problem.

##### Share on other sites
Posted (edited)
Is there a particular reason you need to go from a 2D data structure to a 1D structure?

Thanks Adam, each buf_uavData will be sum up to only one value by doing GPU reduction in latter pass, so to save a little boundary check and idx computation, I think it will be better to convert it to 1D in this pass.

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie: nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z); //buf_uavData6[uIdx] = float4(npqn, cpqn.z); And see if removing just one of the stores has any effect on overall time.

I've tried remove 7 buffer write to only 1, and you are right, it still takes 9ms. So it seems 5buf load is causing the slowness, I will try removing some buf load to see what happens, but during the mean time, any suggestions?

Thanks so much

Edited by Mr_Fox

##### Share on other sites

The textures are tiny, so I don't imagine it's that. Are you sure you haven't done something silly like Dispatch(512, 424, 1) rather than Dispatch(512/8, 424/8, 1)?

##### Share on other sites

The textures are tiny, so I don't imagine it's that. Are you sure you haven't done something silly like Dispatch(512, 424, 1) rather than Dispatch(512/8, 424/8, 1)?

Sorry, I didn't notice AllZero still write to 7 bufs..... after change that, the runtime is almost linear related to how many buf I write to... So I guess beside changing all Raw buf to 64bit typed, there is nothing else I can do to make it faster?

Thanks

##### Share on other sites
Posted (edited)

What if you try rolling all the conditionals into one clause and have only one call to AllZero? I'm just wondering if your data/execution is sufficiently divergent that you're hitting multiple AllZero paths in a single wave.

##### Share on other sites

What if you try rolling all the conditionals into one clause and have only one call to AllZero? I'm just wondering if your data/execution is sufficiently divergent that you're hitting multiple AllZero paths in a single wave.

Well, removing all these condition checking didn't help at all. But that brought up a question I really want to ask:

I remember I got the following tip from somewhere:

"when GPU warp begin waiting on mem, GPU switch to another warp to prevent stalling... and to ensure there always some instructions to execute, it is better not to have mem instruction crowed together...."

so that's why you see my code have memory access scattered all over the function body. But is that tip really helpful?

Also another reason I have memory access scattered is to prevent long live variables (thus I can have more register reused, and consume less register slot), and should I keep doing this kind of 'tips' or shader compiler is already very good at it, and I should not bother?

Thanks

##### Share on other sites
Posted (edited)
void AllZero(uint uIdx) { buf_uavData0[uIdx] = 0.f; buf_uavData1[uIdx] = 0.f; buf_uavData2[uIdx] = 0.f; buf_uavData3[uIdx] = 0.f; buf_uavData4[uIdx] = 0.f; buf_uavData5[uIdx] = 0.f; buf_uavData6[uIdx] = 0.f; }

On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

(It's well documented, but i forgot proper terminology and did not understood technical reasons well enough, something like memory access has to be serialized because they all take something like the same lane or whatever)

Example using a single buffer to make it clear:

mem[0] = 0;

mem[256] = 0;

mem[512] = 0;

...

This may happen because i have chosen e.g. a list size of 256 and write to multiple lists.

To fix the issue, i change my list size to 257:

mem[0] = 0;

mem[257] = 0;

mem[514] = 0;

I have had this case only once for now, the shader did 2ms of work and needed another 2ms just to write the results. After changing list size the writes got hidden behind the work and did not affect execution time anymore.

I don't know if Nvidia has this kind of problem too and how you can easily test this, maybe by adding a extra column of pixels so 257x256?

Somehow it makes little sense if we think of examples like multiple rendertargets used so often and probably having multiple of power of 2 sizes most of the time, but maybe it helps.

Edited by JoeJ

##### Share on other sites
Posted (edited)

You have more than 5 buffer loads, you also have two times four loads for the depth load, you can probably replace them with a gather with little ALU. Asuming you are bandwidth bound as there is not much ALU in your shader, what are the image formats, are they all float4 or you have 8888 and other smaller footprints ? What happen if you put fake 1x1 image ( that does not trigger early zeroing ) ?

Edited by galop1n

##### Share on other sites

you can probably replace them with a gather with little ALU.

Thanks galop1n, I actually had a look on gather, however, it require normalized coordinate [0,1] rather than [0, resolution]. So in my code, it means doing extra uv computing (may not be a good deal though...). But isn't gather and 4 (2x2 neighbor) Load should have same mem workload? or gather is using special hardware which is more efficient than Load?

In my case I have the option to switch to a thinner image format, but that means loosing precision (since I am doing computation, it may matter). But I will try using fake 1x1 image to see how that affect performance.

Thanks

On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

Thanks JoeJ, but could you elaborate on that a little more?  Are you talking about something related to bank conflicts?

##### Share on other sites

Yes gather only works with normalized coordinates in HLSL, but if you are memory bound, usually gather + alu outperform 4 loads with offsets. Once you have the result of the 1x1 test, you should be fixed on the bottleneck ( or at least discard that culprit ).

##### Share on other sites

usually gather + alu outperform 4 loads with offsets
Is that because gather uses special HW? or it somehow issues less mem reads?

Thanks

##### Share on other sites

JoeJ, on 07 Jan 2017 - 08:37 AM, said: On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images. Thanks JoeJ, but could you elaborate on that a little more? Are you talking about something related to bank conflicts?

See there, starting at 'Channel Conflicts':

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_18986

You could calculate your bandwidth requirements and compare with GPU specs to estimate the time the shader should take (guess the shader is totally memory bound).

If you are far off such conflicts may be the reason.

Maybe you could also fix this at the time you allocate the memory for buf_uavData buffers by putting all in one allocation and insert a small offset (1 x unused float4) between them to change their stride.

##### Share on other sites

Thanks JoeJ. I feel so intimidating there are so much things to consider if you want you shader run at full speed :(

##### Share on other sites

Yep, i wish i would know more about how this is related to graphics where powers of two are everywhere.

There's also no public discussion, blog posts etc. about it.

I try to calc your bandwidth needs:

512*424*28*4 = 24MB write

512*424*28 = 6 MB

total 30 MB - that's nothing, did i sum up correctly?

680m has bandwidth of 115.2 GB/s, so almost 2GB per 60 Hz frame.

It can't be a bandwidth limit, register usage seems not that bad, no LDS. This should be 10 times faster.

Conclusion: I'm pretty sure it's about bank or channel conflicts.

For a quick test, you can try:

buf_uavData0[uIdx] = 0.f;
buf_uavData1[uIdx+1] = 0.f;
buf_uavData2[uIdx+2] = 0.f;
buf_uavData3[uIdx+3] = 0.f;
buf_uavData4[uIdx+4] = 0.f;
buf_uavData5[uIdx+5] = 0.f;
buf_uavData6[uIdx+6] = 0.f;

and use the same pattern for the other writes and also the reads.

Does this change anything?

##### Share on other sites

I would still like to hear about the 1x1 texture read test, and a dumb question while i am on it, did you think about dividing the resolution by 8 in the dispatch call ?

Also from my understanding of bank conflict, all what said do not apply, it is at the warp level and local to a single instruction, so because your write use the dispatch id, every single write you have already interact with all the banks and there is no conflict.

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

Edited by galop1n

##### Share on other sites

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

My assumption is that the various uav buffers have an x^2 offset,

like &buf_uavData0[0] + 32 == &buf_uavData1[0], of course with a larger offset.

If that's the case conflict may happen, do you agree?

512x424 does not give a power of two, but a multiple of it, so is it possible?

(i assume the target buf_uavData buffers have the same size)

Edit: I did read too quickly,

what you mean is

for a parallel instruction,

what i mean is

mem[index + powerOf2*1] = 0

mem[index + powerOf2*2] = 0

for successive instructions

I think both are problems, but i need to make sure i remember correctly...

Edited by JoeJ

##### Share on other sites

After looking through JoeJ's bandwidth calculation I decide to look deeper about my dispatch call (though some of you guys already suggest that, but since my dispatch wrapper call takes threadcount and do the round up divide internally, so I don't believe I will make such dumb thing... but it turned out I am even dumber......)

I struggled a lot whether post this or not since it's so embarrassing......

Lesson learned:

do not use default params

void Dispatch2D(
size_t GroupSizeX = 8, size_t GroupSizeY = 8);
void Dispatch3D(
size_t GroupSizeX = 8, size_t GroupSizeY = 8, size_t GroupSizeZ = 8);


the dispatch call I made before:    Dispatch2D(512, 424, 1);

the correct disptach call:               Dispatch3D(512, 424, 1);  // the previous one is typo.....

Also my THREAD=8  conflict with another pair in one of the shader included file, so get redfined to 128.......

After fix all my dumbness, now it takes 0.35ms.... (0.25ms is theoretical limit based on 115.2GB/s)

Though this is super embarrassing, I really appreciate all you guys' help. And I will double check everything before I post questions here in the future

Edited by Mr_Fox

##### Share on other sites

Ha ha :)

But that's a very common mistake. To prevent it i use atomic incrementing a debug buffer value by each workgroup, and display the value on screen when starting a new shader.

This way i see strange huge or tiny numbers popping in my eye faster :)

I made sure my perf drop has been caused indeed by successive instructions and the offsets between them, NOT by offsets between a parallel instruction.

(The writes are even arranged in a inner loop impossible to unroll, so not tightly successive)

But the OpenCL optimization guide i have linked talks only about the parallel instruction case like galop1n mentioned.

I never heared of something like that elsewhere. Maybe it's caused by bundling writes together somewhere, or unlikely the compiler managed to convert it to a parallel write.

So keep it in mind guys, maybe it hits you as well some time soon... :)

##### Share on other sites

Haha, i was also feeling embarrassed asking the dispatch parameter question, that is such a dumb mistake that is usually catch quicker :)

Glad you solved you problem that has nothing to do with bandwidth, bank conflict and whatever crazy theory :)

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628647
• Total Posts
2984034

• 10
• 9
• 9
• 10
• 21