Jump to content
  • Advertisement
Sign in to follow this  
_void_

DX12 Looks like shader compiler optimizes away code for some reason

Recommended Posts

Hello guys,

I am working on implementing deferred texturing tecnique.

I have screen-space material ID texture from render G-Buffer pass for which I would like to calculate screen space rectangles encompasing material IDs used for the same mesh type. By mesh type I refer to one pipeline state object permutation used for G-Buffer shading pass. Those screen space rectangles are later used to shade G-Buffer based on mesh type as described in article Deferred+ from GPU Zen. 

My compute shader pass for calculating encompasing rectangles does not produce expected results.

I did some debugging with PIX and I can see that PIX for some reason does not show g_MaterialIDTexture and g_MeshTypePerMaterialIDBuffer in the list of binded resources.

When I step with debugger through the shader code,  reading g_MaterialIDTexture and MeshTypePerMaterialIDBuffer is skipped.  You can see the shader below.

groupshared uint2 g_ScreenMinPoints[NUM_MESH_TYPES];
groupshared uint2 g_ScreenMaxPoints[NUM_MESH_TYPES];

#define NUM_THREADS_PER_GROUP (NUM_THREADS_X * NUM_THREADS_Y)

cbuffer AppDataBuffer : register(b0)
{
	AppData g_AppData;
}

RWStructuredBuffer<uint2> g_ShadingRectangleMinPointBuffer : register(u0);
RWStructuredBuffer<uint2> g_ShadingRectangleMaxPointBuffer : register(u1);

Texture2D<uint> g_MaterialIDTexture : register(t0);
Buffer<uint> g_MeshTypePerMaterialIDBuffer : register(t1);

[numthreads(NUM_THREADS_X, NUM_THREADS_Y, 1)]
void Main(uint3 globalThreadId : SV_DispatchThreadID, uint localThreadIndex : SV_GroupIndex)
{
	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		g_ScreenMinPoints[index] = uint2(0xffffffff, 0xffffffff);
		g_ScreenMaxPoints[index] = uint2(0, 0);
	}
	GroupMemoryBarrierWithGroupSync();

	if ((globalThreadId.x < g_AppData.screenSize.x) && (globalThreadId.y < g_AppData.screenSize.y))
	{
		uint materialID = g_MaterialIDTexture[globalThreadId.xy];
		uint meshType = g_MeshTypePerMaterialIDBuffer[materialID];
				
		InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
		InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

		InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
		InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
	}
	GroupMemoryBarrierWithGroupSync();

	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].x, g_ScreenMinPoints[index].x);
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].y, g_ScreenMinPoints[index].y);

		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].x, g_ScreenMaxPoints[index].x);
		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].y, g_ScreenMaxPoints[index].y);
	}
}

I checked DXBC output and it does not include them either. 

//
// Generated by Microsoft (R) HLSL Shader Compiler 10.1
//
//
// Buffer Definitions: 
//
// cbuffer AppDataBuffer
// {
//
//   struct AppData
//   {
//       
//       float4x4 viewMatrix;           // Offset:    0
//       float4x4 viewInvMatrix;        // Offset:   64
//       float4x4 projMatrix;           // Offset:  128
//       float4x4 projInvMatrix;        // Offset:  192
//       float4x4 viewProjMatrix;       // Offset:  256
//       float4x4 viewProjInvMatrix;    // Offset:  320
//       float4x4 prevViewProjMatrix;   // Offset:  384
//       float4x4 prevViewProjInvMatrix;// Offset:  448
//       float4x4 notUsed1;             // Offset:  512
//       float4 cameraWorldSpacePos;    // Offset:  576
//       float4 cameraWorldFrustumPlanes[6];// Offset:  592
//       float cameraNearPlane;         // Offset:  688
//       float cameraFarPlane;          // Offset:  692
//       float2 notUsed2;               // Offset:  696
//       uint2 screenSize;              // Offset:  704
//       float2 rcpScreenSize;          // Offset:  712
//       uint2 screenHalfSize;          // Offset:  720
//       float2 rcpScreenHalfSize;      // Offset:  728
//       uint2 screenQuarterSize;       // Offset:  736
//       float2 rcpScreenQuarterSize;   // Offset:  744
//       float4 sunWorldSpaceDir;       // Offset:  752
//       float4 sunLightColor;          // Offset:  768
//       float4 notUsed3[15];           // Offset:  784
//
//   } g_AppData;                       // Offset:    0 Size:  1024
//
// }
//
// Resource bind info for g_ShadingRectangleMinPointBuffer
// {
//
//   uint2 $Element;                    // Offset:    0 Size:     8
//
// }
//
// Resource bind info for g_ShadingRectangleMaxPointBuffer
// {
//
//   uint2 $Element;                    // Offset:    0 Size:     8
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim      HLSL Bind  Count
// ------------------------------ ---------- ------- ----------- -------------- ------
// g_ShadingRectangleMinPointBuffer        UAV  struct         r/w             u0      1 
// g_ShadingRectangleMaxPointBuffer        UAV  struct         r/w             u1      1 
// AppDataBuffer                     cbuffer      NA          NA            cb0      1 
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Input
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// no Output
      0x00000000: cs_5_0
      0x00000008: dcl_globalFlags refactoringAllowed | skipOptimization
      0x0000000C: dcl_constantbuffer CB0[45], immediateIndexed
      0x0000001C: dcl_uav_structured u0, 8
      0x0000002C: dcl_uav_structured u1, 8
      0x0000003C: dcl_input vThreadIDInGroupFlattened
      0x00000044: dcl_input vThreadID.xy
      0x0000004C: dcl_temps 2
      0x00000054: dcl_tgsm_structured g0, 8, 1
      0x00000068: dcl_tgsm_structured g1, 8, 1
      0x0000007C: dcl_thread_group 16, 16, 1
//
// Initial variable locations:
//   vThreadID.x <- globalThreadId.x; vThreadID.y <- globalThreadId.y; vThreadID.z <- globalThreadId.z; 
//   vThreadIDInGroupFlattened.x <- localThreadIndex
//
#line 22 "D:\GitHub\RenderSDK\Samples\Bin\DynamicGI\Shaders\CalcShadingRectanglesCS.hlsl"
   0  0x0000008C: mov r0.x, vThreadIDInGroupFlattened.x
   1  0x0000009C: mov r0.y, r0.x
   2  0x000000B0: loop 
   3  0x000000B4:   mov r0.z, l(1)
   4  0x000000C8:   ult r0.z, r0.y, r0.z
   5  0x000000E4:   breakc_z r0.z

#line 24
   6  0x000000F0:   store_structured g0.x, l(0), l(0), l(-1)
   7  0x00000114:   store_structured g0.x, l(0), l(4), l(-1)

#line 25
   8  0x00000138:   mov r0.zw, l(0,0,0,0)
   9  0x00000158:   store_structured g1.x, l(0), l(0), r0.z
  10  0x0000017C:   store_structured g1.x, l(0), l(4), r0.w

#line 26
  11  0x000001A0:   mov r0.z, l(256)
  12  0x000001B4:   iadd r0.y, r0.z, r0.y
  13  0x000001D0: endloop 

#line 27
  14  0x000001D4: sync_g_t

#line 29
  15  0x000001D8: ult r0.x, vThreadID.x, cb0[44].x
  16  0x000001F4: ult r0.y, vThreadID.y, cb0[44].y
  17  0x00000210: and r0.x, r0.y, r0.x
  18  0x0000022C: if_nz r0.x

#line 34
  19  0x00000238:   atomic_umin g0, l(0, 0, 0, 0), vThreadID.x

#line 35
  20  0x0000025C:   atomic_umin g0, l(0, 4, 0, 0), vThreadID.y

#line 37
  21  0x00000280:   atomic_umax g1, l(0, 0, 0, 0), vThreadID.x

#line 38
  22  0x000002A4:   atomic_umax g1, l(0, 4, 0, 0), vThreadID.y

#line 39
  23  0x000002C8: endif 

#line 40
  24  0x000002CC: sync_g_t

#line 42
  25  0x000002D0: mov r0.x, vThreadIDInGroupFlattened.x  // r0.x <- index
  26  0x000002E0: mov r1.x, r0.x  // r1.x <- index
  27  0x000002F4: loop 
  28  0x000002F8:   mov r0.y, l(1)
  29  0x0000030C:   ult r0.y, r1.x, r0.y
  30  0x00000328:   breakc_z r0.y

#line 44
  31  0x00000334:   ld_structured r0.y, l(0), l(0), g0.xxxx
  32  0x00000358:   mov r1.y, l(0)
  33  0x0000036C:   atomic_umin u0, r1.xyxx, r0.y

#line 45
  34  0x00000388:   ld_structured r0.y, l(0), l(4), g0.xxxx
  35  0x000003AC:   mov r1.z, l(4)
  36  0x000003C0:   atomic_umin u0, r1.xzxx, r0.y

#line 47
  37  0x000003DC:   ld_structured r0.y, l(0), l(0), g1.xxxx
  38  0x00000400:   atomic_umax u1, r1.xyxx, r0.y

#line 48
  39  0x0000041C:   ld_structured r0.y, l(0), l(4), g1.xxxx
  40  0x00000440:   atomic_umax u1, r1.xzxx, r0.y

#line 49
  41  0x0000045C:   mov r0.y, l(256)
  42  0x00000470:   iadd r1.x, r0.y, r1.x
  43  0x0000048C: endloop 

#line 50
  44  0x00000490: ret 
// Approximately 45 instruction slots used

Looks like compiler optimizes them away but I do not understand why. Any ideas? :-)

 

Thanks,

Share this post


Link to post
Share on other sites
Advertisement

You didn't provide a fully compilable repro, so I had to guess the missing bits, but on the compiler I have here there's references to g_MaterialIDTexture and g_MeshTypePerMaterialIDBuffer in the DXBC. Can you provide something that compiles?

Share this post


Link to post
Share on other sites

@ajmiles Updated shader

struct AppData
{
	float4x4 viewMatrix;
	float4x4 viewInvMatrix;
	float4x4 projMatrix;
	float4x4 projInvMatrix;

	float4x4 viewProjMatrix;
	float4x4 viewProjInvMatrix;
	float4x4 prevViewProjMatrix;
	float4x4 prevViewProjInvMatrix;

	float4x4 notUsed1;
	float4 cameraWorldSpacePos;
	float4 cameraWorldFrustumPlanes[6];
	float cameraNearPlane;
	float cameraFarPlane;
	float2 notUsed2;
	uint2 screenSize;
	float2 rcpScreenSize;
	uint2 screenHalfSize;
	float2 rcpScreenHalfSize;
	uint2 screenQuarterSize;
	float2 rcpScreenQuarterSize;
	float4 sunWorldSpaceDir;

	float4 sunLightColor;
	float4 notUsed3[15];
};

#define NUM_MESH_TYPES 1
#define NUM_THREADS_X 16
#define NUM_THREADS_Y 16

groupshared uint2 g_ScreenMinPoints[NUM_MESH_TYPES];
groupshared uint2 g_ScreenMaxPoints[NUM_MESH_TYPES];

#define NUM_THREADS_PER_GROUP (NUM_THREADS_X * NUM_THREADS_Y)

cbuffer AppDataBuffer : register(b0)
{
	AppData g_AppData;
}

RWStructuredBuffer<uint2> g_ShadingRectangleMinPointBuffer : register(u0);
RWStructuredBuffer<uint2> g_ShadingRectangleMaxPointBuffer : register(u1);

Texture2D<uint> g_MaterialIDTexture : register(t0);
Buffer<uint> g_MeshTypePerMaterialIDBuffer : register(t1);

[numthreads(NUM_THREADS_X, NUM_THREADS_Y, 1)]
void main(uint3 globalThreadId : SV_DispatchThreadID, uint localThreadIndex : SV_GroupIndex)
{
	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		g_ScreenMinPoints[index] = uint2(0xffffffff, 0xffffffff);
		g_ScreenMaxPoints[index] = uint2(0, 0);
	}
	GroupMemoryBarrierWithGroupSync();

	if ((globalThreadId.x < g_AppData.screenSize.x) && (globalThreadId.y < g_AppData.screenSize.y))
	{
		uint materialID = g_MaterialIDTexture[globalThreadId.xy];
		uint meshType = g_MeshTypePerMaterialIDBuffer[materialID];
				
		InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
		InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

		InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
		InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
	}
	GroupMemoryBarrierWithGroupSync();

	for (uint index = localThreadIndex; index < NUM_MESH_TYPES; index += NUM_THREADS_PER_GROUP)
	{
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].x, g_ScreenMinPoints[index].x);
		InterlockedMin(g_ShadingRectangleMinPointBuffer[index].y, g_ScreenMinPoints[index].y);

		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].x, g_ScreenMaxPoints[index].x);
		InterlockedMax(g_ShadingRectangleMaxPointBuffer[index].y, g_ScreenMaxPoints[index].y);
	}
}

 

Share this post


Link to post
Share on other sites

@ajmilesNow I understand :-)

I am using MAX_INT as meshType for those pixels in the screen where geometry is missing.

I was hoping that writes out of the array boundaries at index MAX_INT will be ignored.

But in this case compiler just optimized away the code.

I have added explicit check against MAX_INT and everything works now. Thanks a million!

if (meshType != MAX_INT)
{
  InterlockedMin(g_ScreenMinPoints[meshType].x, globalThreadId.x);
  InterlockedMin(g_ScreenMinPoints[meshType].y, globalThreadId.y);

  InterlockedMax(g_ScreenMaxPoints[meshType].x, globalThreadId.x);
  InterlockedMax(g_ScreenMaxPoints[meshType].y, globalThreadId.y);
}

 

Share this post


Link to post
Share on other sites
13 minutes ago, _void_ said:

I was hoping that writes out of the array boundaries at index MAX_INT will be ignored.

I'm afraid not. Out of bounds writes to group shared memory cause the entire contents of shared memory to become undefined. You might be thinking of UAV writes which do discard out of bound writes.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Tags

  • Similar Content

    • By Roman R.
      I have a problem synchronizing data between shared resources. On the input I receive a D3D11 2D texture, which itself is enabled for sharing and has D3D11_RESOURCE_MISC_SHARED_NTHANDLE in its description.
      Having a D3D12 device created on the same adapter I open a resource through sharing a handle.
                  const CComQIPtr<IDXGIResource1> pDxgiResource1 = pTexture; // <<--- Input texture on 11 device
                  HANDLE hTexture;
                  pDxgiResource1->CreateSharedHandle(NULL, GENERIC_ALL, NULL, &hTexture);
                  CComPtr<ID3D12Resource> pResource; // <<--- Counterparty resource on 12 device
                  pDevice->OpenSharedHandle(hTexture, __uuidof(ID3D12Resource), (VOID**) &pResource);
      I tend to keep the mapping between the 11 texture and 12 resource further as they are re-filled with data, but in context of the current problem it does not matter if I reuse the mapping or I do OpenSharedHandle on every iteration.
      Further on I have a command list on 12 device where I use 12 resource (pResource) as a copy source. It is an argument in further CopyResource or CopyTextureRegion calls. I don't have any resource barriers in the command list (including that my attempts to use any don't change the behavior).
      My problem is that I can't have the data synchronized. Sometimes and especially initially the resource has the correct data, however further iterations have issues such as resource having stale/old data.
      I tried to flush immediate context on 11 device to make sure that preceding commands are completed.
      I tried to insert resource barriers at the beginning of command list to possibly make sure that source resource has time to receive the correct data.
      Same time I have other code paths which don't do OpenSharedHandle mapping and instead do additional texture copying and mapping between original 11 device and 11on12 device, and the code including the rest of the logic works well there. This makes me think that I fail to synchronize the data on the step I mentioned above, even though I am lost how do I synchronize exactly outside of command list.
      I originally thought that 12 resource has a IDXGIKeyedMutex implementation which is the case with sharing-enabled 11 textures, but I don't have the IDXGIKeyedMutex and I don't see what is the D3D12 equivalent, if any.
      Could you please advise where to look at to fix the sync?
    • By NikiTo
      Recently I read that the APIs are faking some behaviors, giving to the user false impressions.
      I assume Shader Model 6 issues the wave instructions to the hardware for real, not faking them.

      Is Shader Model 6, mature enough? Can I expect the same level of optimization form Model 6 as from Model 5? Should I expect more bugs from 6 than 5?
      Would the extensions of the manufacturer provide better overall code than the Model 6, because, let say, they know their own hardware better?

      What would you prefer to use for your project- Shader Model 6 or GCN Shader Extensions for DirectX?

      Which of them is easier to set up and use in Visual Studio(practically)?
    • By mark_braga
      I am trying to get the DirectX Control Panel to let me do something like changing the break severity but everything is greyed out.
      Is there any way I can make the DirectX Control Panel work?
      Here is a screenshot of the control panel.
       

    • By Keith P Parsons
      I seem to remember seeing a version of directx 11 sdk that was implemented in directx12 on the microsoft website but I can't seem to find it anymore. Does any one else remember ever seeing this project or was it some kind off fever dream I had? It would be a nice tool for slowly porting my massive amount of directx 11 code to 12 overtime.
    • By NikiTo
      In the shader code, I need to determine to which AppendStructuredBuffers the data should append. And the AppendStructuredBuffers are more than 30.
      Is declaring 30+ AppendStructuredBuffers going to overkill the shader? Buffers descriptors should consume SGPRs.

      Some other way to distribute the output over multiple AppendStructuredBuffers?

      Is emulating the push/pop functionality with one single byte address buffer worth it? Wouldn't it be much slower than using AppendStructuredBuffer?
    • By Sobe118
      I am rendering a large number of objects for a simulation. Each object has instance data and the size of the instance data * number of objects is greater than 4GB. 
      CreateCommittedResource is giving me: E_OUTOFMEMORY Ran out of memory. 
      My PC has 128GB (only 8% ish used prior to testing this), I am running the DirectX app as x64. <Creating a CPU sided resource so GPU ram doesn't matter here, but using Titan X cards if that's a question>
      Simplified code test that recreates the issue (inserted the code into Microsofts D3D12HelloWorld): 
      unsigned long long int siz = pow(2, 32) + 1024; D3D12_FEATURE_DATA_D3D12_OPTIONS options; //MaxGPUVirtualAddressBitsPerResource = 40 m_device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &options, sizeof(options)); HRESULT oops = m_device->CreateCommittedResource( &CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_UPLOAD), D3D12_HEAP_FLAG_NONE, &CD3DX12_RESOURCE_DESC::Buffer(siz), D3D12_RESOURCE_STATE_GENERIC_READ, nullptr, IID_PPV_ARGS(&m_vertexBuffer)); if (oops != S_OK) { printf("Uh Oh"); } I tried enabling "above 4G" in the bios, which didn't do anything. I also tested using malloc to allocate a > 4G array, that worked in the app without issue. 
      Are there more options or build setup that needs to be done? (Using Visual Studio 2015)
      *Other approaches to solving this are welcome too. I thought about splitting up the set of items to render into a couple of sets with a size < 4G each but would rather have one set of objects. 
      Thank you.
    • By _void_
      Hey guys!
      I am not sure how to specify array slice for GatherRed function on Texture2DArray in HLSL.
      According to MSDN, "location" is one float value. Is it a 3-component float with 3rd component for array slice?
      Thanks!
    • By lubbe75
      I have a winforms project that uses SharpDX (DirectX 12). The SharpDX library provides a RenderForm (based on a System.Windows.Forms.Form). 
      Now I need to convert the project to WPF instead. What is the best way to do this?
      I have seen someone pointing to a library, SharpDX.WPF at Codeplex, but according to their info it only provides support up to DX11.
      (Sorry if this has been asked before. The search function seems to be down at the moment)
    • By korben_4_leeloo
      Hi.
      I wanted to experiment D3D12 development and decided to run some tutorials: Microsoft DirectX-Graphics-Samples, Braynzar Soft, 3dgep...Whatever sample I run, I've got the same crash.
      All the initialization process is going well, no error, return codes ok, but as soon as the Present method is invoked on the swap chain, I'm encountering a crash with the following call stack:
      https://drive.google.com/open?id=10pdbqYEeRTZA5E6Jm7U5Dobpn-KE9uOg
      The crash is an access violation to a null pointer ( with an offset of 0x80 )
      I'm working on a notebook, a toshiba Qosmio x870 with two gpu's: an integrated Intel HD 4000 and a dedicated NVIDIA GTX 670M ( Fermi based ). The HD 4000 is DX11 only and as far as I understand the GTX 670M is DX12 with a feature level 11_0. 
      I checked that the good adapter was chosen by the sample, and when the D3D12 device is asked in the sample with a 11_0 FL, it is created with no problem. Same for all the required interfaces ( swap chain, command queue...).
      I tried a lot of things to solve the problem or get some info, like forcing the notebook to always use the NVIDIA gpu, disabling the debug layer, asking for a different feature level ( by the way 11_0 is the only one that allows me to create the device, any other FL will fail at device creation )...
      I have the latest NVIDIA drivers ( 391.35 ), the latest Windows 10 sdk ( 10.0.17134.0 ) and I'm working under 
      Visual Studio 2017 Community.
      Thanks to anybody who can help me find the problem...
    • By _void_
      Hi guys!
      In a lot of samples found in the internet, people when initialize D3D12_SHADER_RESOURCE_VIEW_DESC with resource array size 1 would normallay set its dimension as Texture2D. If the array size is greater than 1, then they would use dimension as Texture2DArray, for an example.
      If I declare in the shader SRV as Texture2DArray but create SRV as Texture2D (array has only 1 texture) following the same principle as above, would this be OK? I guess, this should work as long as I am using array index 0 to access my texture?
      Thanks!
  • Advertisement
  • Popular Now

  • Forum Statistics

    • Total Topics
      631392
    • Total Posts
      2999739
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!