Sign in to follow this  

Suspicious slow compute shader... Mem write bottleneck?

Recommended Posts

Hey Guys,

 

I come here to borrow you guys' insightful eyes again to help me find what I did wrong...

 

 

I have a compute shader which will take 5 input Texture2D (they all 512x424) and then do bunch of computation and output 28 float for each input pixel position. Here is the shader:

#include "FastICP.inl"
#include "CalibData.inl"
Texture2D<uint> tex_srvKinectDepth : register(t0);    //R16_UINT
Texture2D<uint> tex_srvTSDFDepth : register(t1);      //R16_UINT
Texture2D<float4> tex_srvKinectNormal : register(t2); //R10G10B10A2_UNORM
Texture2D<float4> tex_srvTSDFNormal : register(t3);   //R10G10B10A2_UNORM
Texture2D<float> tex_srvWeight : register(t4);        //R8_UNORM

RWStructuredBuffer<float4> buf_uavData0 : register(u0);//CxCx,CxCy,CxCz,Ctr
RWStructuredBuffer<float4> buf_uavData1 : register(u1);//CxNx,CxNy,CxNz,CyCy
RWStructuredBuffer<float4> buf_uavData2 : register(u2);//CyNx,CyNy,CyNz,CyCz
RWStructuredBuffer<float4> buf_uavData3 : register(u3);//CzNx,CzNy,CzNz,CzCz
RWStructuredBuffer<float4> buf_uavData4 : register(u4);//NxNx,NxNy,NxNy,CxPQN
RWStructuredBuffer<float4> buf_uavData5 : register(u5);//NyNy,NyNz,NzNz,CyPQN
RWStructuredBuffer<float4> buf_uavData6 : register(u6);//NxPQN,NyPQN,NzPQN,CzPQN

void AllZero(uint uIdx)
{
    buf_uavData0[uIdx] = 0.f;
    buf_uavData1[uIdx] = 0.f;
    buf_uavData2[uIdx] = 0.f;
    buf_uavData3[uIdx] = 0.f;
    buf_uavData4[uIdx] = 0.f;
    buf_uavData5[uIdx] = 0.f;
    buf_uavData6[uIdx] = 0.f;
}

float3 ReprojectPt(uint2 u2xy, float fDepth)
{
    return float3(float2(u2xy - DEPTH_C) * fDepth / DEPTH_F, fDepth);
}

float GetNormalMatchedDepth(Texture2D<uint> tex_srvDepth, uint3 DTid)
{
    uint uAccDepth = tex_srvDepth.Load(DTid);
    uAccDepth += tex_srvDepth.Load(DTid, uint2(0, 1));
    uAccDepth += tex_srvDepth.Load(DTid, uint2(1, 0));
    uAccDepth += tex_srvDepth.Load(DTid, uint2(1, 1));
    return uAccDepth * -0.001f / 4.f;
}

[numthreads(8, 8, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
    uint uIdx = DTid.x + DTid.y * u2AlignedReso.x;
    if (tex_srvWeight.Load(DTid) < 0.05f) {
        AllZero(uIdx);
        return;
    }
    float4 f4KinectNormal = tex_srvKinectNormal.Load(DTid) * 2.f - 1.f;
    // No valid normal data
    if (f4KinectNormal.w < 0.05f) {
        AllZero(uIdx);
        return;
    }
    float4 f4TSDFNormal = tex_srvTSDFNormal.Load(DTid) * 2.f - 1.f;
    // No valid normal data
    if (f4TSDFNormal.w < 0.05f) {
        AllZero(uIdx);
        return;
    }
    // Normals are too different
    if (dot(f4TSDFNormal.xyz, f4KinectNormal.xyz) < fNormalDiffThreshold) {
        AllZero(uIdx);
        return;
    }
    float fDepth = GetNormalMatchedDepth(tex_srvKinectDepth, DTid);
    // p is Kinect point, q is TSDF point, n is TSDF normal
    // c = p x n
    float3 p = ReprojectPt(DTid.xy, fDepth);
    float3 n = f4TSDFNormal.xyz;
    float3 c = cross(p, n);

    float3 cc = c.xxx * c.xyz; // Get CxCx, CxCy, CxCz
    buf_uavData0[uIdx] = float4(cc, 1.f); // last element is counter

    cc = c.yyz * c.yzz; // Get CyCy, CyCz, CzCz
    float3 cn = c.x * n; // Get CxNx, CxNy, CxNz
    buf_uavData1[uIdx] = float4(cn, cc.x);

    cn = c.y * n; // Get CyNx, CyNy, CyNz
    buf_uavData2[uIdx] = float4(cn, cc.y);

    cn = c.z * n; // Get CzNx, CzNy, CzNz
    buf_uavData3[uIdx] = float4(cn, cc.z);

    fDepth = GetNormalMatchedDepth(tex_srvTSDFDepth, DTid);
    float3 q = ReprojectPt(DTid.xy, fDepth);
    float pqn = dot(p - q, n);
    float3 cpqn = c * pqn; // Get cx(p-q)n, cy(p-q)n, cz(p-q)n

    float3 nn = n.xxx * n.xyz; // Get NxNx, NxNy, NxNz
    buf_uavData4[uIdx] = float4(nn, cpqn.x);

    nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz
    buf_uavData5[uIdx] = float4(nn, cpqn.y);

    float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n
    buf_uavData6[uIdx] = float4(npqn, cpqn.z);
    return;
}

Though I know this is mem intensive, and know it will be a little bit slow, but with 512x424 input resolution, taking 10ms on GTX680m doesn't seems right.  Nvidia Nsight doesn't support 680m, so I can't get detailed perf data about where is the bottleneck (I know it must be mem write, but I don't think it will cause 10ms GPU time....or am I wrong?)

I kinda see I can change all the output UAV raw buffer to 64bit Typed buffer should help, but that means l lost precision.... So I think it's better first discuss with you guys before I try the typed one.

 

Also I was wondering maybe this pass is better using Pixel Shader since I didn't use LDS at all, and PS could use compressed write to RTs (Correct if I am wrong about that...)  which may help with the mem write, but my output data size may exceed num_of_RTs limits, so end up with multiple pass....

 

So please let me know if you see me doing something silly in the code, or you have any suggestions.

 

As always, big thanks in advance

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Is there a particular reason you need to go from a 2D data structure to a 1D structure?

 

I never like doing that from the point of view of getting a good cache access pattern on reads/writes.

 

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie:

nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz

float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n

buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z);
//buf_uavData6[uIdx] = float4(npqn, cpqn.z);

And see if removing just one of the stores has any effect on overall time.

 

The amount of bandwidth required to read/write that amount of data is ~26MB by my count, so doesn't worthy of 10ms on such a GPU.

 

The list of things I would try are:

 

1) Write to a 2D data structure to see how it affects GPU time.

2) Remove one, two, then three buffer stores, but accumulate the results into other buffers to see how much faster it gets if you write less data (and write less times).

3) Use a typed buffer to see if it really is the write bandwidth that's a problem.

Share this post


Link to post
Share on other sites
Is there a particular reason you need to go from a 2D data structure to a 1D structure?

Thanks Adam, each buf_uavData will be sum up to only one value by doing GPU reduction in latter pass, so to save a little boundary check and idx computation, I think it will be better to convert it to 1D in this pass.

 

 

 

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie: nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z); //buf_uavData6[uIdx] = float4(npqn, cpqn.z); And see if removing just one of the stores has any effect on overall time.  

 

I've tried remove 7 buffer write to only 1, and you are right, it still takes 9ms. So it seems 5buf load is causing the slowness, I will try removing some buf load to see what happens, but during the mean time, any suggestions?

 

Thanks so much 

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

The textures are tiny, so I don't imagine it's that. Are you sure you haven't done something silly like Dispatch(512, 424, 1) rather than Dispatch(512/8, 424/8, 1)?

Sorry, I didn't notice AllZero still write to 7 bufs..... after change that, the runtime is almost linear related to how many buf I write to... So I guess beside changing all Raw buf to 64bit typed, there is nothing else I can do to make it faster?

 

Thanks

Share this post


Link to post
Share on other sites

What if you try rolling all the conditionals into one clause and have only one call to AllZero? I'm just wondering if your data/execution is sufficiently divergent that you're hitting multiple AllZero paths in a single wave.
 

 

Well, removing all these condition checking didn't help at all. But that brought up a question I really want to ask:

 

I remember I got the following tip from somewhere:

 

"when GPU warp begin waiting on mem, GPU switch to another warp to prevent stalling... and to ensure there always some instructions to execute, it is better not to have mem instruction crowed together...."

 

so that's why you see my code have memory access scattered all over the function body. But is that tip really helpful?

 

Also another reason I have memory access scattered is to prevent long live variables (thus I can have more register reused, and consume less register slot), and should I keep doing this kind of 'tips' or shader compiler is already very good at it, and I should not bother?

 

Thanks

Share this post


Link to post
Share on other sites
void AllZero(uint uIdx) { buf_uavData0[uIdx] = 0.f; buf_uavData1[uIdx] = 0.f; buf_uavData2[uIdx] = 0.f; buf_uavData3[uIdx] = 0.f; buf_uavData4[uIdx] = 0.f; buf_uavData5[uIdx] = 0.f; buf_uavData6[uIdx] = 0.f; }

 

On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

(It's well documented, but i forgot proper terminology and did not understood technical reasons well enough, something like memory access has to be serialized because they all take something like the same lane or whatever)

 

Example using a single buffer to make it clear:

 

mem[0] = 0;

mem[256] = 0;

mem[512] = 0;

...

 

This may happen because i have chosen e.g. a list size of 256 and write to multiple lists.

To fix the issue, i change my list size to 257:

 

mem[0] = 0;

mem[257] = 0;

mem[514] = 0;

 

I have had this case only once for now, the shader did 2ms of work and needed another 2ms just to write the results. After changing list size the writes got hidden behind the work and did not affect execution time anymore.

 

I don't know if Nvidia has this kind of problem too and how you can easily test this, maybe by adding a extra column of pixels so 257x256?

Somehow it makes little sense if we think of examples like multiple rendertargets used so often and probably having multiple of power of 2 sizes most of the time, but maybe it helps.

Edited by JoeJ

Share this post


Link to post
Share on other sites

You have more than 5 buffer loads, you also have two times four loads for the depth load, you can probably replace them with a gather with little ALU. Asuming you are bandwidth bound as there is not much ALU in your shader, what are the image formats, are they all float4 or you have 8888 and other smaller footprints ? What happen if you put fake 1x1 image ( that does not trigger early zeroing ) ?

Edited by galop1n

Share this post


Link to post
Share on other sites

you can probably replace them with a gather with little ALU.

Thanks galop1n, I actually had a look on gather, however, it require normalized coordinate [0,1] rather than [0, resolution]. So in my code, it means doing extra uv computing (may not be a good deal though...). But isn't gather and 4 (2x2 neighbor) Load should have same mem workload? or gather is using special hardware which is more efficient than Load? 

 

In my case I have the option to switch to a thinner image format, but that means loosing precision (since I am doing computation, it may matter). But I will try using fake 1x1 image to see how that affect performance.

 

Thanks 


On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

 

Thanks JoeJ, but could you elaborate on that a little more?  Are you talking about something related to bank conflicts?

Share this post


Link to post
Share on other sites

JoeJ, on 07 Jan 2017 - 08:37 AM, said: On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images. Thanks JoeJ, but could you elaborate on that a little more? Are you talking about something related to bank conflicts?

 

See there, starting at 'Channel Conflicts':

http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401334_18986

 

You could calculate your bandwidth requirements and compare with GPU specs to estimate the time the shader should take (guess the shader is totally memory bound).

If you are far off such conflicts may be the reason.

 

Maybe you could also fix this at the time you allocate the memory for buf_uavData buffers by putting all in one allocation and insert a small offset (1 x unused float4) between them to change their stride.

Share this post


Link to post
Share on other sites

Yep, i wish i would know more about how this is related to graphics where powers of two are everywhere.

There's also no public discussion, blog posts etc. about it.

 

I try to calc your bandwidth needs:

 

512*424*28*4 = 24MB write

 

each thread reads 28 byte, i guess you have one thread per pixel, so

 

512*424*28 = 6 MB

 

total 30 MB - that's nothing, did i sum up correctly?

 

680m has bandwidth of 115.2 GB/s, so almost 2GB per 60 Hz frame.

 

It can't be a bandwidth limit, register usage seems not that bad, no LDS. This should be 10 times faster.

Conclusion: I'm pretty sure it's about bank or channel conflicts.

 

For a quick test, you can try:

 

buf_uavData0[uIdx] = 0.f;
buf_uavData1[uIdx+1] = 0.f;
buf_uavData2[uIdx+2] = 0.f;
buf_uavData3[uIdx+3] = 0.f;
buf_uavData4[uIdx+4] = 0.f;
buf_uavData5[uIdx+5] = 0.f;
buf_uavData6[uIdx+6] = 0.f;

 

and use the same pattern for the other writes and also the reads.

 

Does this change anything?

Share this post


Link to post
Share on other sites

I would still like to hear about the 1x1 texture read test, and a dumb question while i am on it, did you think about dividing the resolution by 8 in the dispatch call ?

 

Also from my understanding of bank conflict, all what said do not apply, it is at the warp level and local to a single instruction, so because your write use the dispatch id, every single write you have already interact with all the banks and there is no conflict.

 

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

Edited by galop1n

Share this post


Link to post
Share on other sites

To get a conflict, you should have a line like : uav[ dtid.x * 32] = foo;

 

My assumption is that the various uav buffers have an x^2 offset,

like &buf_uavData0[0] + 32 == &buf_uavData1[0], of course with a larger offset.

 

If that's the case conflict may happen, do you agree?

512x424 does not give a power of two, but a multiple of it, so is it possible?

(i assume the target buf_uavData buffers have the same size)

 

Edit: I did read too quickly,

 

what you mean is

mem[threadID * powerOf2] = 0

for a parallel instruction,

 

what i mean is

mem[index + powerOf2*1] = 0

mem[index + powerOf2*2] = 0

for successive instructions

 

I think both are problems, but i need to make sure i remember correctly...

Edited by JoeJ

Share this post


Link to post
Share on other sites

After looking through JoeJ's bandwidth calculation I decide to look deeper about my dispatch call (though some of you guys already suggest that, but since my dispatch wrapper call takes threadcount and do the round up divide internally, so I don't believe I will make such dumb thing... but it turned out I am even dumber......)

 

 

I struggled a lot whether post this or not since it's so embarrassing......

 

 

Lesson learned:

 

do not use default params

void Dispatch2D(
        size_t ThreadCountX, size_t ThreadCountY,
        size_t GroupSizeX = 8, size_t GroupSizeY = 8);
void Dispatch3D(
        size_t ThreadCountX, size_t ThreadCountY, size_t ThreadCountZ,
        size_t GroupSizeX = 8, size_t GroupSizeY = 8, size_t GroupSizeZ = 8);

the dispatch call I made before:    Dispatch2D(512, 424, 1);

the correct disptach call:               Dispatch3D(512, 424, 1);  // the previous one is typo.....

 

 

 

Also my THREAD=8  conflict with another pair in one of the shader included file, so get redfined to 128.......

 

 

 

After fix all my dumbness, now it takes 0.35ms.... (0.25ms is theoretical limit based on 115.2GB/s)

 

Though this is super embarrassing, I really appreciate all you guys' help. And I will double check everything before I post questions here in the future

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

Ha ha :)

 

But that's a very common mistake. To prevent it i use atomic incrementing a debug buffer value by each workgroup, and display the value on screen when starting a new shader.

This way i see strange huge or tiny numbers popping in my eye faster :)

 

 

I made sure my perf drop has been caused indeed by successive instructions and the offsets between them, NOT by offsets between a parallel instruction.

(The writes are even arranged in a inner loop impossible to unroll, so not tightly successive)

But the OpenCL optimization guide i have linked talks only about the parallel instruction case like galop1n mentioned.

 

I never heared of something like that elsewhere. Maybe it's caused by bundling writes together somewhere, or unlikely the compiler managed to convert it to a parallel write.

So keep it in mind guys, maybe it hits you as well some time soon... :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this