Sign in to follow this  
_void_

DX12 Looks like shader compiler optimizes away code for some reason

Recommended Posts

Hello guys,

I am working on implementing deferred texturing tecnique.

I have screen-space material ID texture from render G-Buffer pass for which I would like to calculate screen space rectangles encompasing material IDs used for the same mesh type. By mesh type I refer to one pipeline state object permutation used for G-Buffer shading pass. Those screen space rectangles are later used to shade G-Buffer based on mesh type as described in article Deferred+ from GPU Zen. 

My compute shader pass for calculating encompasing rectangles does not produce expected results.

I did some debugging with PIX and I can see that PIX for some reason does not show g_MaterialIDTexture and g_MeshTypePerMaterialIDBuffer in the list of binded resources.

When I step with debugger through the shader code,  reading g_MaterialIDTexture and MeshTypePerMaterialIDBuffer is skipped.  You can see the shader below.

groupshared uint2 g_ScreenMinPoints[NUM_MESH_TYPES];
groupshared uint2 g_ScreenMaxPoints[NUM_MESH_TYPES];

#define NUM_THREADS_PER_GROUP (NUM_THREADS_X * NUM_THREADS_Y)

cbuffer AppDataBuffer : register(b0)
{
	AppData g_AppData;
}

RWStructuredBuffer<uint2> g_ShadingRectangleMinPointBuffer : register(u0);
RWStructuredBuffer<uint2> g_ShadingRectangleMaxPointBuffer : register(u1);

Texture2D<uint> g_MaterialIDTexture : register(t0);
Buffer<uint> g_MeshTypePerMaterialIDBuffer : register(t1);

[numthreads(NUM_THREADS_X, NUM_THREADS_Y, 1)]
void Main(uint3 globalThreadId : SV_DispatchThreadID, uint localThreadIndex : SV_GroupIndex)
{
	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		g_ScreenMinPoints[index] = uint2(0xffffffff, 0xffffffff);
		g_ScreenMaxPoints[index] = uint2(0, 0);
	}
	GroupMemoryBarrierWithGroupSync();

	if ((globalThreadId.x < g_AppData.screenSize.x) && (globalThreadId.y < g_AppData.screenSize.y))
	{
		uint materialID = g_MaterialIDTexture[globalThreadId.xy];
		uint meshType = g_MeshTypePerMaterialIDBuffer[materialID];
				
		InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
		InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

		InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
		InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
	}
	GroupMemoryBarrierWithGroupSync();

	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].x, g_ScreenMinPoints[index].x);
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].y, g_ScreenMinPoints[index].y);

		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].x, g_ScreenMaxPoints[index].x);
		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].y, g_ScreenMaxPoints[index].y);
	}
}

I checked DXBC output and it does not include them either. 

//
// Generated by Microsoft (R) HLSL Shader Compiler 10.1
//
//
// Buffer Definitions: 
//
// cbuffer AppDataBuffer
// {
//
//   struct AppData
//   {
//       
//       float4x4 viewMatrix;           // Offset:    0
//       float4x4 viewInvMatrix;        // Offset:   64
//       float4x4 projMatrix;           // Offset:  128
//       float4x4 projInvMatrix;        // Offset:  192
//       float4x4 viewProjMatrix;       // Offset:  256
//       float4x4 viewProjInvMatrix;    // Offset:  320
//       float4x4 prevViewProjMatrix;   // Offset:  384
//       float4x4 prevViewProjInvMatrix;// Offset:  448
//       float4x4 notUsed1;             // Offset:  512
//       float4 cameraWorldSpacePos;    // Offset:  576
//       float4 cameraWorldFrustumPlanes[6];// Offset:  592
//       float cameraNearPlane;         // Offset:  688
//       float cameraFarPlane;          // Offset:  692
//       float2 notUsed2;               // Offset:  696
//       uint2 screenSize;              // Offset:  704
//       float2 rcpScreenSize;          // Offset:  712
//       uint2 screenHalfSize;          // Offset:  720
//       float2 rcpScreenHalfSize;      // Offset:  728
//       uint2 screenQuarterSize;       // Offset:  736
//       float2 rcpScreenQuarterSize;   // Offset:  744
//       float4 sunWorldSpaceDir;       // Offset:  752
//       float4 sunLightColor;          // Offset:  768
//       float4 notUsed3[15];           // Offset:  784
//
//   } g_AppData;                       // Offset:    0 Size:  1024
//
// }
//
// Resource bind info for g_ShadingRectangleMinPointBuffer
// {
//
//   uint2 $Element;                    // Offset:    0 Size:     8
//
// }
//
// Resource bind info for g_ShadingRectangleMaxPointBuffer
// {
//
//   uint2 $Element;                    // Offset:    0 Size:     8
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim      HLSL Bind  Count
// ------------------------------ ---------- ------- ----------- -------------- ------
// g_ShadingRectangleMinPointBuffer        UAV  struct         r/w             u0      1 
// g_ShadingRectangleMaxPointBuffer        UAV  struct         r/w             u1      1 
// AppDataBuffer                     cbuffer      NA          NA            cb0      1 
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      0x00000000: cs_5_0
      0x00000008: dcl_globalFlags refactoringAllowed | skipOptimization
      0x0000000C: dcl_constantbuffer CB0[45], immediateIndexed
      0x0000001C: dcl_uav_structured u0, 8
      0x0000002C: dcl_uav_structured u1, 8
      0x0000003C: dcl_input vThreadIDInGroupFlattened
      0x00000044: dcl_input vThreadID.xy
      0x0000004C: dcl_temps 2
      0x00000054: dcl_tgsm_structured g0, 8, 1
      0x00000068: dcl_tgsm_structured g1, 8, 1
      0x0000007C: dcl_thread_group 16, 16, 1
//
// Initial variable locations:
//   vThreadID.x <- globalThreadId.x; vThreadID.y <- globalThreadId.y; vThreadID.z <- globalThreadId.z; 
//   vThreadIDInGroupFlattened.x <- localThreadIndex
//
#line 22 "D:\GitHub\RenderSDK\Samples\Bin\DynamicGI\Shaders\CalcShadingRectanglesCS.hlsl"
   0  0x0000008C: mov r0.x, vThreadIDInGroupFlattened.x
   1  0x0000009C: mov r0.y, r0.x
   2  0x000000B0: loop 
   3  0x000000B4:   mov r0.z, l(1)
   4  0x000000C8:   ult r0.z, r0.y, r0.z
   5  0x000000E4:   breakc_z r0.z

#line 24
   6  0x000000F0:   store_structured g0.x, l(0), l(0), l(-1)
   7  0x00000114:   store_structured g0.x, l(0), l(4), l(-1)

#line 25
   8  0x00000138:   mov r0.zw, l(0,0,0,0)
   9  0x00000158:   store_structured g1.x, l(0), l(0), r0.z
  10  0x0000017C:   store_structured g1.x, l(0), l(4), r0.w

#line 26
  11  0x000001A0:   mov r0.z, l(256)
  12  0x000001B4:   iadd r0.y, r0.z, r0.y
  13  0x000001D0: endloop 

#line 27
  14  0x000001D4: sync_g_t

#line 29
  15  0x000001D8: ult r0.x, vThreadID.x, cb0[44].x
  16  0x000001F4: ult r0.y, vThreadID.y, cb0[44].y
  17  0x00000210: and r0.x, r0.y, r0.x
  18  0x0000022C: if_nz r0.x

#line 34
  19  0x00000238:   atomic_umin g0, l(0, 0, 0, 0), vThreadID.x

#line 35
  20  0x0000025C:   atomic_umin g0, l(0, 4, 0, 0), vThreadID.y

#line 37
  21  0x00000280:   atomic_umax g1, l(0, 0, 0, 0), vThreadID.x

#line 38
  22  0x000002A4:   atomic_umax g1, l(0, 4, 0, 0), vThreadID.y

#line 39
  23  0x000002C8: endif 

#line 40
  24  0x000002CC: sync_g_t

#line 42
  25  0x000002D0: mov r0.x, vThreadIDInGroupFlattened.x  // r0.x <- index
  26  0x000002E0: mov r1.x, r0.x  // r1.x <- index
  27  0x000002F4: loop 
  28  0x000002F8:   mov r0.y, l(1)
  29  0x0000030C:   ult r0.y, r1.x, r0.y
  30  0x00000328:   breakc_z r0.y

#line 44
  31  0x00000334:   ld_structured r0.y, l(0), l(0), g0.xxxx
  32  0x00000358:   mov r1.y, l(0)
  33  0x0000036C:   atomic_umin u0, r1.xyxx, r0.y

#line 45
  34  0x00000388:   ld_structured r0.y, l(0), l(4), g0.xxxx
  35  0x000003AC:   mov r1.z, l(4)
  36  0x000003C0:   atomic_umin u0, r1.xzxx, r0.y

#line 47
  37  0x000003DC:   ld_structured r0.y, l(0), l(0), g1.xxxx
  38  0x00000400:   atomic_umax u1, r1.xyxx, r0.y

#line 48
  39  0x0000041C:   ld_structured r0.y, l(0), l(4), g1.xxxx
  40  0x00000440:   atomic_umax u1, r1.xzxx, r0.y

#line 49
  41  0x0000045C:   mov r0.y, l(256)
  42  0x00000470:   iadd r1.x, r0.y, r1.x
  43  0x0000048C: endloop 

#line 50
  44  0x00000490: ret 
// Approximately 45 instruction slots used

Looks like compiler optimizes them away but I do not understand why. Any ideas? :-)

 

Thanks,

Share this post


Link to post
Share on other sites

@ajmiles Updated shader

struct AppData
{
	float4x4 viewMatrix;
	float4x4 viewInvMatrix;
	float4x4 projMatrix;
	float4x4 projInvMatrix;

	float4x4 viewProjMatrix;
	float4x4 viewProjInvMatrix;
	float4x4 prevViewProjMatrix;
	float4x4 prevViewProjInvMatrix;

	float4x4 notUsed1;
	float4 cameraWorldSpacePos;
	float4 cameraWorldFrustumPlanes[6];
	float cameraNearPlane;
	float cameraFarPlane;
	float2 notUsed2;
	uint2 screenSize;
	float2 rcpScreenSize;
	uint2 screenHalfSize;
	float2 rcpScreenHalfSize;
	uint2 screenQuarterSize;
	float2 rcpScreenQuarterSize;
	float4 sunWorldSpaceDir;

	float4 sunLightColor;
	float4 notUsed3[15];
};

#define NUM_MESH_TYPES 1
#define NUM_THREADS_X 16
#define NUM_THREADS_Y 16

groupshared uint2 g_ScreenMinPoints[NUM_MESH_TYPES];
groupshared uint2 g_ScreenMaxPoints[NUM_MESH_TYPES];

#define NUM_THREADS_PER_GROUP (NUM_THREADS_X * NUM_THREADS_Y)

cbuffer AppDataBuffer : register(b0)
{
	AppData g_AppData;
}

RWStructuredBuffer<uint2> g_ShadingRectangleMinPointBuffer : register(u0);
RWStructuredBuffer<uint2> g_ShadingRectangleMaxPointBuffer : register(u1);

Texture2D<uint> g_MaterialIDTexture : register(t0);
Buffer<uint> g_MeshTypePerMaterialIDBuffer : register(t1);

[numthreads(NUM_THREADS_X, NUM_THREADS_Y, 1)]
void main(uint3 globalThreadId : SV_DispatchThreadID, uint localThreadIndex : SV_GroupIndex)
{
	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		g_ScreenMinPoints[index] = uint2(0xffffffff, 0xffffffff);
		g_ScreenMaxPoints[index] = uint2(0, 0);
	}
	GroupMemoryBarrierWithGroupSync();

	if ((globalThreadId.x < g_AppData.screenSize.x) && (globalThreadId.y < g_AppData.screenSize.y))
	{
		uint materialID = g_MaterialIDTexture[globalThreadId.xy];
		uint meshType = g_MeshTypePerMaterialIDBuffer[materialID];
				
		InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
		InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

		InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
		InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
	}
	GroupMemoryBarrierWithGroupSync();

	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].x, g_ScreenMinPoints[index].x);
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].y, g_ScreenMinPoints[index].y);

		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].x, g_ScreenMaxPoints[index].x);
		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].y, g_ScreenMaxPoints[index].y);
	}
}

 

Share this post


Link to post
Share on other sites

@ajmilesNow I understand :-)

I am using MAX_INT as meshType for those pixels in the screen where geometry is missing.

I was hoping that writes out of the array boundaries at index MAX_INT will be ignored.

But in this case compiler just optimized away the code.

I have added explicit check against MAX_INT and everything works now. Thanks a million!

if (meshType != MAX_INT)
{
  InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
  InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

  InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
  InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
}

 

Share this post


Link to post
Share on other sites
13 minutes ago, _void_ said:

I was hoping that writes out of the array boundaries at index MAX_INT will be ignored.

I'm afraid not. Out of bounds writes to group shared memory cause the entire contents of shared memory to become undefined. You might be thinking of UAV writes which do discard out of bound writes.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628401
    • Total Posts
      2982464
  • Similar Content

    • By Trylz Engine
      Hello !
      I would like to share with you a personnal project i started this Year.
      The Trylz Renderer is a CPU  unidirectional path tracer with DirectX 12 preview written in C++

      General features include:
      User interface with basic settings
      Create scenes from model files and save it in xml files
      Render high quality images. The full features and binaries can be seen on the project page. Its is only for windows at the time
       
      An example render i made with it:

    • By ZachBethel
      Hey all,
      I'm trying to debug some async compute synchronization issues. I've found that if I force all command lists to run through a single ID3D12CommandQueue instance, everything is fine. However, if I create two DIRECT queue instances, and feed my "compute" work into the second direct queue, I start seeing the issues again.
      I'm not fencing between the two queues at all because they are both direct. According to the docs, it seems as though command lists should serialize properly between the two instances of the direct queue because they are of the same queue class.
      Another note is that I am feeding command lists to the queues on an async thread, but it's the same thread for both queues, so the work should be serialized properly. Anything obvious I might be missing here?
      Thanks!
    • By Vilem Otte
      So, I've been playing a bit with geometry shaders recently and I've found a very interesting bug, let me show you the code example:
      struct Vert2Geom { float4 mPosition : SV_POSITION; float2 mTexCoord : TEXCOORD0; float3 mNormal : TEXCOORD1; float4 mPositionWS : TEXCOORD2; }; struct Geom2Frag { float4 mPosition : SV_POSITION; nointerpolation float4 mAABB : AABB; float3 mNormal : TEXCOORD1; float2 mTexCoord : TEXCOORD0; nointerpolation uint mAxis : AXIS; float3 temp : TEXCOORD2; }; ... [maxvertexcount(3)] void GS(triangle Vert2Geom input[3], inout TriangleStream<Geom2Frag> output) { ... } So, as soon as I have this Geom2Frag structure - there is a crash, to be precise - the only message I get is:
      D3D12: Removing Device.
      Now, if Geom2Frag last attribute is just type of float2 (hence structure is 4 bytes shorter), there is no crash and everything works as should. I tried to look at limitations for Shader Model 5.1 profiles - and I either overlooked one for geometry shader outputs (which is more than possible - MSDN is confusing in many ways ... but 64 bytes limit seems way too low), or there is something iffy that shader compiler does for me.
      Any ideas why this might happen?
    • By VietNN
      Hi everyone, I am new to Dx12 and working on a game project.
      My game just crash at CreateShaderResourceView with no infomation output in debug log, just: 0xC0000005: Access violation reading location 0x000001F22EF2AFE8.
      my code at current:
      CreateShaderResourceView(m_texture, &desc, *cpuDescriptorHandle);
       - m_texture address is: 0x000001ea3c68c8a0
      - cpuDescriptorHandle address is 0x00000056d88fdd50
      - desc.Format, desc.ViewDimension, Texture2D.MostDetailedMip, Texture2D.MipLevels is initalized.
      The crash happens all times at that stage but not on same m_texture. As I noticed the violation reading location is always somewhere near m_texture address.
      I just declare a temp variable to check how many times CreateShaderResourceView already called, at that moment it is 17879 (means that I created 17879 succesfully), and CreateDescriptorHeap for cpuDescriptorHandle was called 4190, do I reach any limit?
      One more infomation, if I set miplevel of all texture when create to 1 it seem like there is no crash but game quality is bad. Do not sure if it relative or not.
      Anyone could give me some advise ?
    • By VietNN
      Hi all,
      The D3D12_SHADER_RESOURCE_VIEW_DESC has a member Shader4ComponentMapping but I don't really know what is it used for? As several example set its value to D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING. I also read the document on MSDN but still do not understand anything about it.
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn903814(v=vs.85).aspx
      https://msdn.microsoft.com/en-us/library/windows/desktop/dn770406(v=vs.85).aspx
      Anyone could help me, thank you.
  • Popular Now