Suspicious slow compute shader... Mem write bottleneck?

Started by
18 comments, last by galop1n 7 years, 3 months ago

Hey Guys,

I come here to borrow you guys' insightful eyes again to help me find what I did wrong...

I have a compute shader which will take 5 input Texture2D (they all 512x424) and then do bunch of computation and output 28 float for each input pixel position. Here is the shader:


#include "FastICP.inl"
#include "CalibData.inl"
Texture2D<uint> tex_srvKinectDepth : register(t0);    //R16_UINT
Texture2D<uint> tex_srvTSDFDepth : register(t1);      //R16_UINT
Texture2D<float4> tex_srvKinectNormal : register(t2); //R10G10B10A2_UNORM
Texture2D<float4> tex_srvTSDFNormal : register(t3);   //R10G10B10A2_UNORM
Texture2D<float> tex_srvWeight : register(t4);        //R8_UNORM

RWStructuredBuffer<float4> buf_uavData0 : register(u0);//CxCx,CxCy,CxCz,Ctr
RWStructuredBuffer<float4> buf_uavData1 : register(u1);//CxNx,CxNy,CxNz,CyCy
RWStructuredBuffer<float4> buf_uavData2 : register(u2);//CyNx,CyNy,CyNz,CyCz
RWStructuredBuffer<float4> buf_uavData3 : register(u3);//CzNx,CzNy,CzNz,CzCz
RWStructuredBuffer<float4> buf_uavData4 : register(u4);//NxNx,NxNy,NxNy,CxPQN
RWStructuredBuffer<float4> buf_uavData5 : register(u5);//NyNy,NyNz,NzNz,CyPQN
RWStructuredBuffer<float4> buf_uavData6 : register(u6);//NxPQN,NyPQN,NzPQN,CzPQN

void AllZero(uint uIdx)
{
    buf_uavData0[uIdx] = 0.f;
    buf_uavData1[uIdx] = 0.f;
    buf_uavData2[uIdx] = 0.f;
    buf_uavData3[uIdx] = 0.f;
    buf_uavData4[uIdx] = 0.f;
    buf_uavData5[uIdx] = 0.f;
    buf_uavData6[uIdx] = 0.f;
}

float3 ReprojectPt(uint2 u2xy, float fDepth)
{
    return float3(float2(u2xy - DEPTH_C) * fDepth / DEPTH_F, fDepth);
}

float GetNormalMatchedDepth(Texture2D<uint> tex_srvDepth, uint3 DTid)
{
    uint uAccDepth = tex_srvDepth.Load(DTid);
    uAccDepth += tex_srvDepth.Load(DTid, uint2(0, 1));
    uAccDepth += tex_srvDepth.Load(DTid, uint2(1, 0));
    uAccDepth += tex_srvDepth.Load(DTid, uint2(1, 1));
    return uAccDepth * -0.001f / 4.f;
}

[numthreads(8, 8, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
    uint uIdx = DTid.x + DTid.y * u2AlignedReso.x;
    if (tex_srvWeight.Load(DTid) < 0.05f) {
        AllZero(uIdx);
        return;
    }
    float4 f4KinectNormal = tex_srvKinectNormal.Load(DTid) * 2.f - 1.f;
    // No valid normal data
    if (f4KinectNormal.w < 0.05f) {
        AllZero(uIdx);
        return;
    }
    float4 f4TSDFNormal = tex_srvTSDFNormal.Load(DTid) * 2.f - 1.f;
    // No valid normal data
    if (f4TSDFNormal.w < 0.05f) {
        AllZero(uIdx);
        return;
    }
    // Normals are too different
    if (dot(f4TSDFNormal.xyz, f4KinectNormal.xyz) < fNormalDiffThreshold) {
        AllZero(uIdx);
        return;
    }
    float fDepth = GetNormalMatchedDepth(tex_srvKinectDepth, DTid);
    // p is Kinect point, q is TSDF point, n is TSDF normal
    // c = p x n
    float3 p = ReprojectPt(DTid.xy, fDepth);
    float3 n = f4TSDFNormal.xyz;
    float3 c = cross(p, n);

    float3 cc = c.xxx * c.xyz; // Get CxCx, CxCy, CxCz
    buf_uavData0[uIdx] = float4(cc, 1.f); // last element is counter

    cc = c.yyz * c.yzz; // Get CyCy, CyCz, CzCz
    float3 cn = c.x * n; // Get CxNx, CxNy, CxNz
    buf_uavData1[uIdx] = float4(cn, cc.x);

    cn = c.y * n; // Get CyNx, CyNy, CyNz
    buf_uavData2[uIdx] = float4(cn, cc.y);

    cn = c.z * n; // Get CzNx, CzNy, CzNz
    buf_uavData3[uIdx] = float4(cn, cc.z);

    fDepth = GetNormalMatchedDepth(tex_srvTSDFDepth, DTid);
    float3 q = ReprojectPt(DTid.xy, fDepth);
    float pqn = dot(p - q, n);
    float3 cpqn = c * pqn; // Get cx(p-q)n, cy(p-q)n, cz(p-q)n

    float3 nn = n.xxx * n.xyz; // Get NxNx, NxNy, NxNz
    buf_uavData4[uIdx] = float4(nn, cpqn.x);

    nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz
    buf_uavData5[uIdx] = float4(nn, cpqn.y);

    float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n
    buf_uavData6[uIdx] = float4(npqn, cpqn.z);
    return;
}

Though I know this is mem intensive, and know it will be a little bit slow, but with 512x424 input resolution, taking 10ms on GTX680m doesn't seems right. Nvidia Nsight doesn't support 680m, so I can't get detailed perf data about where is the bottleneck (I know it must be mem write, but I don't think it will cause 10ms GPU time....or am I wrong?)

I kinda see I can change all the output UAV raw buffer to 64bit Typed buffer should help, but that means l lost precision.... So I think it's better first discuss with you guys before I try the typed one.

Also I was wondering maybe this pass is better using Pixel Shader since I didn't use LDS at all, and PS could use compressed write to RTs (Correct if I am wrong about that...) which may help with the mem write, but my output data size may exceed num_of_RTs limits, so end up with multiple pass....

So please let me know if you see me doing something silly in the code, or you have any suggestions.

As always, big thanks in advance

Advertisement

Is there a particular reason you need to go from a 2D data structure to a 1D structure?

I never like doing that from the point of view of getting a good cache access pattern on reads/writes.

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie:


nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz

float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n

buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z);
//buf_uavData6[uIdx] = float4(npqn, cpqn.z);

And see if removing just one of the stores has any effect on overall time.

The amount of bandwidth required to read/write that amount of data is ~26MB by my count, so doesn't worthy of 10ms on such a GPU.

The list of things I would try are:

1) Write to a 2D data structure to see how it affects GPU time.

2) Remove one, two, then three buffer stores, but accumulate the results into other buffers to see how much faster it gets if you write less data (and write less times).

3) Use a typed buffer to see if it really is the write bandwidth that's a problem.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Is there a particular reason you need to go from a 2D data structure to a 1D structure?

Thanks Adam, each buf_uavData will be sum up to only one value by doing GPU reduction in latter pass, so to save a little boundary check and idx computation, I think it will be better to convert it to 1D in this pass.

I would start by removing one buffer store (buf_uavData6) in such a way that the compiler can't remove any other computation, ie: nn = n.yyz * n.yzz; // Get NyNy, NyNz, NzNz float3 npqn = n * pqn; // Get nx(p-q)n, ny(p-q)n, nz(p-q)n buf_uavData5[uIdx] = float4(nn, cpqn.y) + float4(npqn, cpqn.z); //buf_uavData6[uIdx] = float4(npqn, cpqn.z); And see if removing just one of the stores has any effect on overall time.

I've tried remove 7 buffer write to only 1, and you are right, it still takes 9ms. So it seems 5buf load is causing the slowness, I will try removing some buf load to see what happens, but during the mean time, any suggestions?

Thanks so much

The textures are tiny, so I don't imagine it's that. Are you sure you haven't done something silly like Dispatch(512, 424, 1) rather than Dispatch(512/8, 424/8, 1)?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

The textures are tiny, so I don't imagine it's that. Are you sure you haven't done something silly like Dispatch(512, 424, 1) rather than Dispatch(512/8, 424/8, 1)?

Sorry, I didn't notice AllZero still write to 7 bufs..... after change that, the runtime is almost linear related to how many buf I write to... So I guess beside changing all Raw buf to 64bit typed, there is nothing else I can do to make it faster?

Thanks

What if you try rolling all the conditionals into one clause and have only one call to AllZero? I'm just wondering if your data/execution is sufficiently divergent that you're hitting multiple AllZero paths in a single wave.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

What if you try rolling all the conditionals into one clause and have only one call to AllZero? I'm just wondering if your data/execution is sufficiently divergent that you're hitting multiple AllZero paths in a single wave.

Well, removing all these condition checking didn't help at all. But that brought up a question I really want to ask:

I remember I got the following tip from somewhere:

"when GPU warp begin waiting on mem, GPU switch to another warp to prevent stalling... and to ensure there always some instructions to execute, it is better not to have mem instruction crowed together...."

so that's why you see my code have memory access scattered all over the function body. But is that tip really helpful?

Also another reason I have memory access scattered is to prevent long live variables (thus I can have more register reused, and consume less register slot), and should I keep doing this kind of 'tips' or shader compiler is already very good at it, and I should not bother?

Thanks

void AllZero(uint uIdx) { buf_uavData0[uIdx] = 0.f; buf_uavData1[uIdx] = 0.f; buf_uavData2[uIdx] = 0.f; buf_uavData3[uIdx] = 0.f; buf_uavData4[uIdx] = 0.f; buf_uavData5[uIdx] = 0.f; buf_uavData6[uIdx] = 0.f; }

On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

(It's well documented, but i forgot proper terminology and did not understood technical reasons well enough, something like memory access has to be serialized because they all take something like the same lane or whatever)

Example using a single buffer to make it clear:

mem[0] = 0;

mem[256] = 0;

mem[512] = 0;

...

This may happen because i have chosen e.g. a list size of 256 and write to multiple lists.

To fix the issue, i change my list size to 257:

mem[0] = 0;

mem[257] = 0;

mem[514] = 0;

I have had this case only once for now, the shader did 2ms of work and needed another 2ms just to write the results. After changing list size the writes got hidden behind the work and did not affect execution time anymore.

I don't know if Nvidia has this kind of problem too and how you can easily test this, maybe by adding a extra column of pixels so 257x256?

Somehow it makes little sense if we think of examples like multiple rendertargets used so often and probably having multiple of power of 2 sizes most of the time, but maybe it helps.

You have more than 5 buffer loads, you also have two times four loads for the depth load, you can probably replace them with a gather with little ALU. Asuming you are bandwidth bound as there is not much ALU in your shader, what are the image formats, are they all float4 or you have 8888 and other smaller footprints ? What happen if you put fake 1x1 image ( that does not trigger early zeroing ) ?

you can probably replace them with a gather with little ALU.

Thanks galop1n, I actually had a look on gather, however, it require normalized coordinate [0,1] rather than [0, resolution]. So in my code, it means doing extra uv computing (may not be a good deal though...). But isn't gather and 4 (2x2 neighbor) Load should have same mem workload? or gather is using special hardware which is more efficient than Load?

In my case I have the option to switch to a thinner image format, but that means loosing precision (since I am doing computation, it may matter). But I will try using fake 1x1 image to see how that affect performance.

Thanks


On GCN this can get extremely slow if the various buf_uavDataX buffers have an exact offset of a power of 2 in memory, which is likely to happen for 512*512 images.

Thanks JoeJ, but could you elaborate on that a little more? Are you talking about something related to bank conflicts?

This topic is closed to new replies.

Advertisement