• Advertisement
Sign in to follow this  

DX11 [D3D12] Total brightness problem in compute shader

This topic is 722 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

The code below is supposed to calculate total brightness of tsrc texture.

 

The good thing is HD Graphics 4600 calculates it just fine. The bad thing is GTX 980 does not.

 

The values I read from read-back buffer fluctuate wildly, but they seem to stay below correct value.

 

I took the code for atomic addition of float values from this thread http://www.gamedev.net/topic/613648-dx11-interlockedadd-on-floats-in-pixel-shader-workaround/

 

I have no idea what's going on. Thanks in advance.

 

EDIT: 'globallycoherent' doesn't work. Using 'InterlockedAdd' and summing uint's doesn't work.

#define TotalGroups 32
#define RSDT "RootFlags(0), UAV(u0), DescriptorTable(SRV(t0))" // Descriptor table is required for texture

Texture2D<float4> tsrc: register(t0);
RWByteAddressBuffer total : register(u0);

groupshared float4 bpacked[TotalGroups*TotalGroups];

float brightness(float4 cl) {
  // TODO: replace with correct implementation
  return cl.r + cl.g + cl.b;
}

[RootSignature(RSDT)]
[numthreads(TotalGroups,TotalGroups,1)]
void CSTotal(uint3 gtid: SV_GroupThreadId, uint3 gid : SV_GroupId, uint gindex : SV_GroupIndex, uint3 dtid : SV_DispatchThreadID) {
  uint2 crd = (gid.xy * TotalGroups + gtid.xy)*2;
  float br[4];
  [unroll]
  for (uint x = 0; x < 2; ++x) {
    [unroll]
    for (uint y = 0; y < 2; ++y) {
      // Color outside of tsrc is guarantied to be 0.
      br[y * 2 + x] = brightness(tsrc[crd+uint2(x,y)]);
    }
  }
  bpacked[gindex] = float4(br[0],br[1],br[2],br[3]);
  if (all(dtid == uint3(0, 0, 0))) {
?    // set initial value of total brightness accumulator
    total.Store(0, asuint(0.0));
  };
  AllMemoryBarrierWithGroupSync();
?  // bpacked array now contains brightness in each component of each value

  // reduce bpacked to single value
  [unroll]
  for (uint thres = TotalGroups*TotalGroups / 2; thres > 0; thres /= 2) {
    if (gindex < thres) {
      bpacked[gindex] += bpacked[gindex + thres];
    }
    AllMemoryBarrierWithGroupSync();
  }
  if (gindex == 0) {
    float4 cl = bpacked[0];
    float value = cl.r + cl.g + cl.b + cl.a;

    // First thread in thread group atomically adds calculated brightness to the accumulator
    uint comp, orig = total.Load(0);
    [allow_uav_condition]do
    {
      comp = orig;
      total.InterlockedCompareExchange(0, comp, asuint(asfloat(orig) + value), orig);
    } while (orig != comp);
  }
}

Invocation of compute shader is written in Rust. But it should be sufficiently readable.

 

I'm sure that Rust bindings to D3D12 are not the cause for the problem. I work with them for months without problems.

        let src_desc = srv_tex2d_default_slice_mip(srcdesc.Format, 0, 1);
        core.dev.create_shader_resource_view(Some(&src), Some(&src_desc), res.total_dheap.cpu_handle(0));

        clist.set_pipeline_state(&self.total_cpso);
        clist.set_compute_root_signature(&self.total_rs);
        clist.set_descriptor_heaps(&[res.total_dheap.get()]);
        clist.set_compute_root_descriptor_table(1, res.total_dheap.gpu_handle(0));
        clist.set_compute_root_unordered_access_view(0, res.rw_total.get_gpu_virtual_address());

        clist.resource_barrier(&[
          *ResourceBarrier::transition(&src,
            D3D12_RESOURCE_STATE_COMMON, D3D12_RESOURCE_STATE_NON_PIXEL_SHADER_RESOURCE),
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_COMMON, D3D12_RESOURCE_STATE_UNORDERED_ACCESS),
        ]);

        clist.dispatch(cw / TOTAL_CHUNK_SIZE, ch / TOTAL_CHUNK_SIZE, 1);
        clist.resource_barrier(&[
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_UNORDERED_ACCESS, D3D12_RESOURCE_STATE_COPY_SOURCE),
        ]);
        clist.copy_resource(&res.rb_total, &res.rw_total);
        clist.resource_barrier(&[
          *ResourceBarrier::transition(&res.rw_total,
            D3D12_RESOURCE_STATE_COPY_SOURCE, D3D12_RESOURCE_STATE_COMMON),
        ]);
        try!(clist.close());

        core.compute_queue.execute_command_lists(&[clist]);

        wait_for_compute_queue(core, &res.fence, &create_event());

        let total_brightness = res.total_brightness();
        let avg_brightness = total_brightness / cw as f32 / ch as f32;


Edited by red75prime

Share this post


Link to post
Share on other sites
Advertisement

I knew it. This part looked just too ugly.

  if (all(dtid == uint3(0, 0, 0))) {
?    // set initial value of total brightness accumulator
    total.Store(0, asuint(0.0));
  };

When I replaced it with ClearUnorderedAccessViewUint... I've got program crash on GTX 980. It seems this function is broken on NVidia.

 

Then I replaced it with

[RootSignature(RSDT)]
[numthreads(1,1,1)]
void CSClearTotal() {
  total[0] = 0;
}

And now it works.

 

EDIT: Maybe ClearUnorderedAccessViewUint isn't broken. Maybe I don't understand what second parameter means.

Edited by red75prime

Share this post


Link to post
Share on other sites

I took the code for atomic addition of float values from this thread http://www.gamedev.net/topic/613648-dx11-interlockedadd-on-floats-in-pixel-shader-workaround/

 

Don't spin your Gpu like this. Read this instead: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

 

or search "parallel reduction". There's actually faster ways to do it on an nvidia gpu that aren't exposed to dx12 sad.png.

 

They're reducing twice the number of values in a 1920x1080 texture (4 billion items) in .268ms on something like 5 year old hardware for a reference benchmark.

Edited by Dingleberry

Share this post


Link to post
Share on other sites

 

 

Don't spin your Gpu like this. Read this instead: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

 

or search "parallel reduction". There's actually faster ways to do it on an nvidia gpu that aren't exposed to dx12 sad.png.

 

They're reducing twice the number of values in a 1920x1080 texture (4 billion items) in .268ms on something like 5 year old hardware for a reference benchmark.

 

 

I implemented parallel reduction inside a thread group. Atomic addition is performed once per thread group. And performance is not ?that bad. Around 4ms for 3840x2160 R32G32B32A32_FLOAT texture.

 

Vendor-specific optimizations can be safely postponed, I think.

Edited by red75prime

Share this post


Link to post
Share on other sites

You might be getting a debug error if ClearUnorderedAccessViewUint fails. I have a GTX 970 and it works fine. You need to have the buffer's uav in a set descriptor heap and also the uav can't be a shader visible heap iirc. That's what got me at first. 

 

I'm still highly skeptical about that atomic float addition. If 4ms is good enough then great, but it seems like a pretty substantial amount of time to me.

Edited by Dingleberry

Share this post


Link to post
Share on other sites

You need to have the buffer's uav in a set descriptor heap and also the uav can't be a shader visible heap iirc

 

Thank you. MSDN doesn't mention any of it and debug layer's message in case of error is not clear at all.

 

I experimented a bit. GPU descriptor handle can be in any descriptor heap (either set or not, either shader visible or not). Debug layer doesn't complain in any case.

 

CPU descriptor handle must not be for UAV in shader visible heap.

 

EDIT: My bad. MSDN has a comment in community additions section, but I use offline docs.

Edited by red75prime

Share this post


Link to post
Share on other sites

D3D12 is really tricky. It took me two weeks to make the code work on 3 out of 4 GPUs I have.

 

Key insight is "A shader cannot reliably read from UAV of a resource filled by another shader, you need to use SRV to read from it".

 

Another one is "Two ways to sum float values produced by different thread groups are a) lock buffer and sum on CPU b) spin in InterlockedCompareExchange"

Share this post


Link to post
Share on other sites

Key insight is "A shader cannot reliably read from UAV of a resource filled by another shader, you need to use SRV to read from it".

Shouldn't that be possible as long as you issue the appropriate resource transition between the two shader dispatches? The transition tells the driver that there's a data dependency between the two dispatches, so it can insert a wait-for-cache-flush command before the 2nd one, ensuring UAV coherency.

Share this post


Link to post
Share on other sites

Shouldn't that be possible as long as you issue the appropriate resource transition between the two shader dispatches? The transition tells the driver that there's a data dependency between the two dispatches, so it can insert a wait-for-cache-flush command before the 2nd one, ensuring UAV coherency.

 

It doesn't work on Microsoft Basic Render Driver and, possibly, on HD 4600 (I have another problems with this one). I just checked it. Also it requires one more resource barrier, d3d12 doesn't allow transitioning from one state into the same state.

Edited by red75prime

Share this post


Link to post
Share on other sites

It sounds like you should have been using D3D12_RESOURCE_UAV_BARRIER.

 

Thanks. I completely forgot about this type of a barrier. Now it is clear, that I have a bug somewhere in the program. UAV barrier doesn't work for Microsoft Basic Render Driver too.

 

I really, really hope it is not a bug in the driver, as it was (and still is) for full-screen transition on certain GPU under specific conditions (I reported the bug).

Edited by red75prime

Share this post


Link to post
Share on other sites

A more typical reduction doesn't require atomics and should operate on a 4k framebuffer in less than 1ms on a modern desktop gpu.

Share this post


Link to post
Share on other sites

Dingleberry, I was able to optimize reduction to 0.4ms on GTX 980 (154GB/s of 224) and to 1.5ms on R7 360 (41GB/s of 96), while still using atomics (a lot less of them, although).

My point was that it's hard to program with this relatively new technology, when even basic assumptions don't work (like commented out part of CSBufTotal in the shader below or one in downsampler.hlsl) and driver bugs are still a relatively high possibility.

I was able to identify full-screen transition bug because Microsoft samples are failing too, but there are no samples for parallel reduction.

 

Shader code: https://github.com/red75prime/dxgen/blob/master/src/dxgen/scaffolding/src/reductor.hlsl

 

EDIT: I still can't find out what could be wrong with HD 4600. Debug layer is silent, but read-back buffers contain only zeroes.

Edited by red75prime

Share this post


Link to post
Share on other sites

I solved HD 4600 problem. This GPU silently refuses to work with RWBuffer<float>. RWStructuredBuffer<float> works just fine.

 

The Microsoft Basic Render Driver case remains to be cracked.

 

 

 


http://diaryofagraphicsprogrammer.blogspot.com/2015/01/reloaded-compute-shader-optimizations.html

 

That part (reduce texture to a smaller texture) wasn't a big problem. The problems began when I tried to efficiently reduce texture to a single number entirely on GPU. But thank you, anyway. It's nice to have some working code.

 

EDIT: Interesting. Either I accidentally fixed MBRD, or the problem reveals itself only under RDP connection. ?No, I'm not fixed it. Brightness value is wrong.

Edited by red75prime

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Now

  • Advertisement
  • Similar Content

    • By AxeGuywithanAxe
      I wanted to see how others are currently handling descriptor heap updates and management.
      I've read a few articles and there tends to be three major strategies :
      1 ) You split up descriptor heaps per shader stage ( i.e one for vertex shader , pixel , hull, etc)
      2) You have one descriptor heap for an entire pipeline
      3) You split up descriptor heaps for update each update frequency (i.e EResourceSet_PerInstance , EResourceSet_PerPass , EResourceSet_PerMaterial, etc)
      The benefits of the first two approaches is that it makes it easier to port current code, and descriptor / resource descriptor management and updating tends to be easier to manage, but it seems to be not as efficient.
      The benefits of the third approach seems to be that it's the most efficient because you only manage and update objects when they change.
    • By evelyn4you
      hi,
      until now i use typical vertexshader approach for skinning with a Constantbuffer containing the transform matrix for the bones and an the vertexbuffer containing bone index and bone weight.
      Now i have implemented realtime environment  probe cubemaping so i have to render my scene from many point of views and the time for skinning takes too long because it is recalculated for every side of the cubemap.
      For Info i am working on Win7 an therefore use one Shadermodel 5.0 not 5.x that have more options, or is there a way to use 5.x in Win 7
      My Graphic Card is Directx 12 compatible NVidia GTX 960
      the member turanszkij has posted a good for me understandable compute shader. ( for Info: in his engine he uses an optimized version of it )
      https://turanszkij.wordpress.com/2017/09/09/skinning-in-compute-shader/
      Now my questions
       is it possible to feed the compute shader with my orignial vertexbuffer or do i have to copy it in several ByteAdressBuffers as implemented in the following code ?
        the same question is about the constant buffer of the matrixes
       my more urgent question is how do i feed my normal pipeline with the result of the compute Shader which are 2 RWByteAddressBuffers that contain position an normal
      for example i could use 2 vertexbuffer bindings
      1 containing only the uv coordinates
      2.containing position and normal
      How do i copy from the RWByteAddressBuffers to the vertexbuffer ?
       
      (Code from turanszkij )
      Here is my shader implementation for skinning a mesh in a compute shader:
      1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 struct Bone { float4x4 pose; }; StructuredBuffer<Bone> boneBuffer;   ByteAddressBuffer vertexBuffer_POS; // T-Pose pos ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal ByteAddressBuffer vertexBuffer_WEI; // bone weights ByteAddressBuffer vertexBuffer_BON; // bone indices   RWByteAddressBuffer streamoutBuffer_POS; // skinned pos RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos   inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei) {  float4 p = 0, pp = 0;  float3 n = 0;  float4x4 m;  float3x3 m3;  float weisum = 0;   // force loop to reduce register pressure  // though this way we can not interleave TEX - ALU operations  [loop]  for (uint i = 0; ((i &lt; 4) &amp;&amp; (weisum&lt;1.0f)); ++i)  {  m = boneBuffer[(uint)inBon].pose;  m3 = (float3x3)m;   p += mul(float4(pos.xyz, 1), m)*inWei;  n += mul(nor.xyz, m3)*inWei;   weisum += inWei;  }   bool w = any(inWei);  pos.xyz = w ? p.xyz : pos.xyz;  nor.xyz = w ? n : nor.xyz; }   [numthreads(1024, 1, 1)] void main( uint3 DTid : SV_DispatchThreadID ) {  const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...   uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);  uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);  uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);  uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);   float4 pos = asfloat(pos_u);  float4 nor = asfloat(nor_u);  float4 wei = asfloat(wei_u);  float4 bon = asfloat(bon_u);   Skinning(pos, nor, bon, wei);   pos_u = asuint(pos);  nor_u = asuint(nor);   // copy prev frame current pos to current frame prev pos streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress)); // write out skinned props:  streamoutBuffer_POS.Store4(fetchAddress, pos_u);  streamoutBuffer_NOR.Store4(fetchAddress, nor_u); }  
    • By mister345
      Hi, can someone please explain why this is giving an assertion EyePosition!=0 exception?
       
      _lightBufferVS->viewMatrix = DirectX::XMMatrixLookAtLH(XMLoadFloat3(&_lightBufferVS->position), XMLoadFloat3(&_lookAt), XMLoadFloat3(&up));
      It looks like DirectX doesnt want the 2nd parameter to be a zero vector in the assertion, but I passed in a zero vector with this exact same code in another program and it ran just fine. (Here is the version of the code that worked - note XMLoadFloat3(&m_lookAt) parameter value is (0,0,0) at runtime - I debugged it - but it throws no exceptions.
          m_viewMatrix = DirectX::XMMatrixLookAtLH(XMLoadFloat3(&m_position), XMLoadFloat3(&m_lookAt), XMLoadFloat3(&up)); Here is the repo for the broken code (See LightClass) https://github.com/mister51213/DirectX11Engine/blob/master/DirectX11Engine/LightClass.cpp
      and here is the repo with the alternative version of the code that is working with a value of (0,0,0) for the second parameter.
      https://github.com/mister51213/DX11Port_SoftShadows/blob/master/Engine/lightclass.cpp
    • By mister345
      Hi, can somebody please tell me in clear simple steps how to debug and step through an hlsl shader file?
      I already did Debug > Start Graphics Debugging > then captured some frames from Visual Studio and
      double clicked on the frame to open it, but no idea where to go from there.
       
      I've been searching for hours and there's no information on this, not even on the Microsoft Website!
      They say "open the  Graphics Pixel History window" but there is no such window!
      Then they say, in the "Pipeline Stages choose Start Debugging"  but the Start Debugging option is nowhere to be found in the whole interface.
      Also, how do I even open the hlsl file that I want to set a break point in from inside the Graphics Debugger?
       
      All I want to do is set a break point in a specific hlsl file, step thru it, and see the data, but this is so unbelievably complicated
      and Microsoft's instructions are horrible! Somebody please, please help.
       
       
       

    • By mister345
      I finally ported Rastertek's tutorial # 42 on soft shadows and blur shading. This tutorial has a ton of really useful effects and there's no working version anywhere online.
      Unfortunately it just draws a black screen. Not sure what's causing it. I'm guessing the camera or ortho matrix transforms are wrong, light directions, or maybe texture resources not being properly initialized.  I didnt change any of the variables though, only upgraded all types and functions DirectX3DVector3 to XMFLOAT3, and used DirectXTK for texture loading. If anyone is willing to take a look at what might be causing the black screen, maybe something pops out to you, let me know, thanks.
      https://github.com/mister51213/DX11Port_SoftShadows
       
      Also, for reference, here's tutorial #40 which has normal shadows but no blur, which I also ported, and it works perfectly.
      https://github.com/mister51213/DX11Port_ShadowMapping
       
  • Advertisement