• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
       
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 using small static arrays in shader cause heavily perf drop ?!

This topic is 887 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,

 

I recently write a crappy dynamic volume render program in dx12(use compute shader to update the volume everyframe), and I found some prerf delta I could not fully understand. 

 

Basically, using a static uint4 colVal[6] array instead of put that in constant buffer will bring the frametime from 5.5ms to 21ms on my machine.

 

My guess will be that using the static array cause the shader to become register bounded, thus causes very bad concurrency in GPU?

 

The following is the full hlsl file, just focus on the red lines and the compute shader part. If you want here is the link to the github page of this small project,

 

https://github.com/pengliu916/VolumetricAnimation.git

 

you can download it and set active project to VolumetricAnimation and test it (modify VolumetricAnimation_shader.hlsl as below to compare perf )

RWStructuredBuffer<uint> g_bufVolumeUAV : register( u0 );

cbuffer cbChangesEveryFrame : register( b0 )
{
	float4x4 worldViewProj;
	float4 viewPos; 
        
        // comment out the following two line and uncomment the next static uint4 colVal[6] block
        // will increase the frametime from 5.5ms to 21ms on my machine!!
	uint4 colVal1[6];
	uint4 bgCol1;
};

// Comment out the uint4 colVal1[6]; uint4 bgCol1; two lines and uncomment this block will
// increase the frametime from 5.5ms to 21ms on my machine!!
//static uint4 colVal[6] = {
//	uint4( 1, 0, 0, 0 ),
//	uint4( 0, 1, 0, 1 ),
//	uint4( 0, 0, 1, 2 ),
//	uint4( 1, 1, 0, 3 ),
//	uint4( 1, 0, 1, 4 ),
//	uint4( 0, 1, 1, 5 )
//};
//static uint4 bgCol = uint4( 64, 64, 64, 64 );

.....

//--------------------------------------------------------------------------------------
// Compute Shader
//--------------------------------------------------------------------------------------
[numthreads( 8, 8, 8 )]
void csmain( uint3 DTid: SV_DispatchThreadID, uint Tid : SV_GroupIndex )
{
	uint4 col = D3DX_R8G8B8A8_UINT_to_UINT4( g_bufVolumeUAV[DTid.x + DTid.y*voxelResolution.x + DTid.z*voxelResolution.x*voxelResolution.y] );
	col.xyz -= colVal[col.w].xyz;
	if ( !any( col.xyz - bgCol.xyz ) )
	{
		col.w = ( col.w + 1 ) % 6;
		col.xyz = 255 * colVal[col.w].xyz + bgCol.xyz; // Let it overflow, it doesn't matter
	}
	g_bufVolumeUAV[DTid.x + DTid.y*voxelResolution.x + DTid.z*voxelResolution.x*voxelResolution.y] = D3DX_UINT4_to_R8G8B8A8_UINT( col );
}

Share this post


Link to post
Share on other sites
Advertisement

I believe what you're seeing is an artifact of Nvidia's more recent GPU architectures that have dedicated "constant memory" (memory for constant buffers) in a way that AMD's GCN doesn't. Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

This article refers to CUDA, but it's worth a read: http://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/

 

The question though is what their shader compiler has done such that having that array in the constant buffer yields better/different performance than that same array in what is effectively D3D's "immediate" constant buffer. The DXBC generated looks the same in the areas where it matters, so while I'm not surprised there is a performance cliff using a dynamically index array like that, I am surprised it doesn't manifest in both cases.

 

Any time I dynamically index data I'll usually use a Buffer/StructuredBuffer. In theory 'tbuffer' is the better choice for dynamically indexed data, but I've never seen anyone use one as Buffer/StructuredBuffer seem to suffice most of the time.

Share this post


Link to post
Share on other sites

I believe what you're seeing is an artifact of Nvidia's more recent GPU architectures that have dedicated "constant memory" (memory for constant buffers) in a way that AMD's GCN doesn't. Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

This article refers to CUDA, but it's worth a read: http://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/

 

The question though is what their shader compiler has done such that having that array in the constant buffer yields better/different performance than that same array in what is effectively D3D's "immediate" constant buffer. The DXBC generated looks the same in the areas where it matters, so while I'm not surprised there is a performance cliff using a dynamically index array like that, I am surprised it doesn't manifest in both cases.

 

Any time I dynamically index data I'll usually use a Buffer/StructuredBuffer. In theory 'tbuffer' is the better choice for dynamically indexed data, but I've never seen anyone use one as Buffer/StructuredBuffer seem to suffice most of the time.

 

Thanks ajmiles.  It's just so hard to believe that I observed such a huge perf drop on that given the fact that the array is so small, and I am sure for this program that most warp have all their threads get the same index on that array,  also since there are only 6 items in the array, there may only be at most 6 'replay' loops (or I got it wrong?) . 

 

BTW, what is the DXBC you mentioned in before? Also it will be greatly appreciated if you could share any gpu profile tools you know which can show GPU occupancy for dx12

 

Thanks

 

Peng

Share this post


Link to post
Share on other sites

I modified the code a little more this evening and ran it on my 980 Ti at home.

 

Firstly I made some changes so Vsync was correctly and fully disabled (this is a little tricky in DX12 at the moment, it's not simply a case of setting the present interval to 0), so I could measure the frame time more accurately. I also changed the pixel shader used to draw the cube to just return white so that the cost was negligible.

 

What I find is that the frame time for the Compute work + drawing a white cube is 0.9ms when using the variables from the constant buffer and 6.5ms when using the "outside the constant buffer" values. This is pretty close to being a 6x slow down (it's in fact slightly more!). If you take col.w and set it to itself col.w % 2 (such that the only valid values for col.w are 0 and 1) then the draw time is about 1.8ms, exactly double what it was in the fastest case. So it does seem like the number of different indices within a wavefront has a strong correlation with draw time on Nvidia hardware.

 

Your understanding is right though, up to 5 'extra' replay loops (in the worst case) can make the draw 6 times longer it seems.

 

The DXBC is the "DirectX byte code", it's the intermediate language/representation a shader gets compiled to before being passed to the GPU vendor's driver for another stage of compilation targeted at a specific GPU. You can see this output either by running 'fxc' (the shader compiler) on the command line, or using a tool like AMD's ShaderAnalyzer, part of GPU PerfStudio.

 

The word 'occupancy' has become a bit of a loaded term these days. On AMD hardware this refers to the number of wavefronts that can be scheduled for execution on a single SIMD, but I wonder if you perhaps meant GPU draw times / performance information?

Share this post


Link to post
Share on other sites

Thanks ajmiles so much to take the time to run test trials. I believe that perfectly explained the perf drop.

 

The occupancy I mentioned is about the number of wavefronts allowed in-flight per shader engine, I have read articles which states that in order to achieve maximum occupancy, there is shader register count limits we should be ware of when write the shader... But as you mentioned before if the array goes to 'constant memory' that is not the case.

 

Also I have read some trick about tbuffer, cbuffer which have not effect here, (or I did it wrong)

// cbuffer indicates the data is in constant buffer, while tbuffer indicates the data is in texture buffer?
// here it make no difference

//cbuffer cbImmutable{
//	static uint4 colVal[6] = {
//		uint4( 1, 0, 0, 0 ),
//		uint4( 0, 1, 0, 1 ),
//		uint4( 0, 0, 1, 2 ),
//		uint4( 1, 1, 0, 3 ),
//		uint4( 1, 0, 1, 4 ),
//		uint4( 0, 1, 1, 5 )
//	};
//}
tbuffer cbImmutable{
	static uint4 colVal[6] = {
		uint4( 1, 0, 0, 0 ),
		uint4( 0, 1, 0, 1 ),
		uint4( 0, 0, 1, 2 ),
		uint4( 1, 1, 0, 3 ),
		uint4( 1, 0, 1, 4 ),
		uint4( 0, 1, 1, 5 )
	};
}

It will be great if you could tell me how to fully disable vsync in dx12 the right way? dx12 doc roughly said by providing more than 2 swapchain buffers and use present(0,0) to disable vsync. 

 

Thanks

Share this post


Link to post
Share on other sites

AMD's ShaderAnalyzer for GCN is part of GPU PerfStudio and will give you occupancy for AMD hardware, but I'm not aware of any equivalent tool for Nvidia hardware. How many registers a shader uses is vendor/hardware specific, so it's not something any Microsoft provided tool could tell you. In fact, the number of registers a shader uses could change from one driver version to the next. That said, your shader is super simple and only uses about 10 registers on GCN, so it won't be an issue.

 

I believe if you're in the Insider Program (Slow or Fast ring) and on the 10586 build then Present will do what you expect, it's just older builds that won't. I'm reluctant to put information on here that will be out of date very soon. Are you still on 10240 (RTM)?

Share this post


Link to post
Share on other sites

What about making the array static const instead of just static? IIRC fxc handled that differently.

I tried that, it made no difference.

Share this post


Link to post
Share on other sites

Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

I have ran some test trails which have the static array size as 8, 16, 32, 64, 128. the running time keeps increase almost linearly as static array size grows... so it seems the 'replay' times not settled done with 31..... which is very confusing.

 

BTW for those who wants to pass macro into  D3DCompileFromFile  in dx12, the  msdn doc is misleading: 

// Pass in macro as the following (msdn way) will fail 
D3D_SHADER_MACRO macro[] = 
{ 
    {"__hlsl",   "1"},  
};

// You have to add nullptrs as last item like this
D3D_SHADER_MACRO macro[] = 
{ 
    {"__hlsl",   "1"}, 
    {nullptr,    nullptr} 
};
Edited by Mr_Fox

Share this post


Link to post
Share on other sites

 

What about making the array static const instead of just static? IIRC fxc handled that differently.

I tried that, it made no difference.

 

 I tried that just now: in release mode, add const will make static array version run as fast as constant buffer array version, while in debug mode adding const make no difference. So probably is hlsl compiler optimization kicked in  

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement