Jump to content
  • Advertisement
Sign in to follow this  
Mr_Fox

DX12 using small static arrays in shader cause heavily perf drop ?!

This topic is 968 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi,

 

I recently write a crappy dynamic volume render program in dx12(use compute shader to update the volume everyframe), and I found some prerf delta I could not fully understand. 

 

Basically, using a static uint4 colVal[6] array instead of put that in constant buffer will bring the frametime from 5.5ms to 21ms on my machine.

 

My guess will be that using the static array cause the shader to become register bounded, thus causes very bad concurrency in GPU?

 

The following is the full hlsl file, just focus on the red lines and the compute shader part. If you want here is the link to the github page of this small project,

 

https://github.com/pengliu916/VolumetricAnimation.git

 

you can download it and set active project to VolumetricAnimation and test it (modify VolumetricAnimation_shader.hlsl as below to compare perf )

RWStructuredBuffer<uint> g_bufVolumeUAV : register( u0 );

cbuffer cbChangesEveryFrame : register( b0 )
{
	float4x4 worldViewProj;
	float4 viewPos; 
        
        // comment out the following two line and uncomment the next static uint4 colVal[6] block
        // will increase the frametime from 5.5ms to 21ms on my machine!!
	uint4 colVal1[6];
	uint4 bgCol1;
};

// Comment out the uint4 colVal1[6]; uint4 bgCol1; two lines and uncomment this block will
// increase the frametime from 5.5ms to 21ms on my machine!!
//static uint4 colVal[6] = {
//	uint4( 1, 0, 0, 0 ),
//	uint4( 0, 1, 0, 1 ),
//	uint4( 0, 0, 1, 2 ),
//	uint4( 1, 1, 0, 3 ),
//	uint4( 1, 0, 1, 4 ),
//	uint4( 0, 1, 1, 5 )
//};
//static uint4 bgCol = uint4( 64, 64, 64, 64 );

.....

//--------------------------------------------------------------------------------------
// Compute Shader
//--------------------------------------------------------------------------------------
[numthreads( 8, 8, 8 )]
void csmain( uint3 DTid: SV_DispatchThreadID, uint Tid : SV_GroupIndex )
{
	uint4 col = D3DX_R8G8B8A8_UINT_to_UINT4( g_bufVolumeUAV[DTid.x + DTid.y*voxelResolution.x + DTid.z*voxelResolution.x*voxelResolution.y] );
	col.xyz -= colVal[col.w].xyz;
	if ( !any( col.xyz - bgCol.xyz ) )
	{
		col.w = ( col.w + 1 ) % 6;
		col.xyz = 255 * colVal[col.w].xyz + bgCol.xyz; // Let it overflow, it doesn't matter
	}
	g_bufVolumeUAV[DTid.x + DTid.y*voxelResolution.x + DTid.z*voxelResolution.x*voxelResolution.y] = D3DX_UINT4_to_R8G8B8A8_UINT( col );
}

Share this post


Link to post
Share on other sites
Advertisement

I believe what you're seeing is an artifact of Nvidia's more recent GPU architectures that have dedicated "constant memory" (memory for constant buffers) in a way that AMD's GCN doesn't. Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

This article refers to CUDA, but it's worth a read: http://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/

 

The question though is what their shader compiler has done such that having that array in the constant buffer yields better/different performance than that same array in what is effectively D3D's "immediate" constant buffer. The DXBC generated looks the same in the areas where it matters, so while I'm not surprised there is a performance cliff using a dynamically index array like that, I am surprised it doesn't manifest in both cases.

 

Any time I dynamically index data I'll usually use a Buffer/StructuredBuffer. In theory 'tbuffer' is the better choice for dynamically indexed data, but I've never seen anyone use one as Buffer/StructuredBuffer seem to suffice most of the time.

Share this post


Link to post
Share on other sites

I believe what you're seeing is an artifact of Nvidia's more recent GPU architectures that have dedicated "constant memory" (memory for constant buffers) in a way that AMD's GCN doesn't. Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

This article refers to CUDA, but it's worth a read: http://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/

 

The question though is what their shader compiler has done such that having that array in the constant buffer yields better/different performance than that same array in what is effectively D3D's "immediate" constant buffer. The DXBC generated looks the same in the areas where it matters, so while I'm not surprised there is a performance cliff using a dynamically index array like that, I am surprised it doesn't manifest in both cases.

 

Any time I dynamically index data I'll usually use a Buffer/StructuredBuffer. In theory 'tbuffer' is the better choice for dynamically indexed data, but I've never seen anyone use one as Buffer/StructuredBuffer seem to suffice most of the time.

 

Thanks ajmiles.  It's just so hard to believe that I observed such a huge perf drop on that given the fact that the array is so small, and I am sure for this program that most warp have all their threads get the same index on that array,  also since there are only 6 items in the array, there may only be at most 6 'replay' loops (or I got it wrong?) . 

 

BTW, what is the DXBC you mentioned in before? Also it will be greatly appreciated if you could share any gpu profile tools you know which can show GPU occupancy for dx12

 

Thanks

 

Peng

Share this post


Link to post
Share on other sites

I modified the code a little more this evening and ran it on my 980 Ti at home.

 

Firstly I made some changes so Vsync was correctly and fully disabled (this is a little tricky in DX12 at the moment, it's not simply a case of setting the present interval to 0), so I could measure the frame time more accurately. I also changed the pixel shader used to draw the cube to just return white so that the cost was negligible.

 

What I find is that the frame time for the Compute work + drawing a white cube is 0.9ms when using the variables from the constant buffer and 6.5ms when using the "outside the constant buffer" values. This is pretty close to being a 6x slow down (it's in fact slightly more!). If you take col.w and set it to itself col.w % 2 (such that the only valid values for col.w are 0 and 1) then the draw time is about 1.8ms, exactly double what it was in the fastest case. So it does seem like the number of different indices within a wavefront has a strong correlation with draw time on Nvidia hardware.

 

Your understanding is right though, up to 5 'extra' replay loops (in the worst case) can make the draw 6 times longer it seems.

 

The DXBC is the "DirectX byte code", it's the intermediate language/representation a shader gets compiled to before being passed to the GPU vendor's driver for another stage of compilation targeted at a specific GPU. You can see this output either by running 'fxc' (the shader compiler) on the command line, or using a tool like AMD's ShaderAnalyzer, part of GPU PerfStudio.

 

The word 'occupancy' has become a bit of a loaded term these days. On AMD hardware this refers to the number of wavefronts that can be scheduled for execution on a single SIMD, but I wonder if you perhaps meant GPU draw times / performance information?

Share this post


Link to post
Share on other sites

Thanks ajmiles so much to take the time to run test trials. I believe that perfectly explained the perf drop.

 

The occupancy I mentioned is about the number of wavefronts allowed in-flight per shader engine, I have read articles which states that in order to achieve maximum occupancy, there is shader register count limits we should be ware of when write the shader... But as you mentioned before if the array goes to 'constant memory' that is not the case.

 

Also I have read some trick about tbuffer, cbuffer which have not effect here, (or I did it wrong)

// cbuffer indicates the data is in constant buffer, while tbuffer indicates the data is in texture buffer?
// here it make no difference

//cbuffer cbImmutable{
//	static uint4 colVal[6] = {
//		uint4( 1, 0, 0, 0 ),
//		uint4( 0, 1, 0, 1 ),
//		uint4( 0, 0, 1, 2 ),
//		uint4( 1, 1, 0, 3 ),
//		uint4( 1, 0, 1, 4 ),
//		uint4( 0, 1, 1, 5 )
//	};
//}
tbuffer cbImmutable{
	static uint4 colVal[6] = {
		uint4( 1, 0, 0, 0 ),
		uint4( 0, 1, 0, 1 ),
		uint4( 0, 0, 1, 2 ),
		uint4( 1, 1, 0, 3 ),
		uint4( 1, 0, 1, 4 ),
		uint4( 0, 1, 1, 5 )
	};
}

It will be great if you could tell me how to fully disable vsync in dx12 the right way? dx12 doc roughly said by providing more than 2 swapchain buffers and use present(0,0) to disable vsync. 

 

Thanks

Share this post


Link to post
Share on other sites

AMD's ShaderAnalyzer for GCN is part of GPU PerfStudio and will give you occupancy for AMD hardware, but I'm not aware of any equivalent tool for Nvidia hardware. How many registers a shader uses is vendor/hardware specific, so it's not something any Microsoft provided tool could tell you. In fact, the number of registers a shader uses could change from one driver version to the next. That said, your shader is super simple and only uses about 10 registers on GCN, so it won't be an issue.

 

I believe if you're in the Insider Program (Slow or Fast ring) and on the 10586 build then Present will do what you expect, it's just older builds that won't. I'm reluctant to put information on here that will be out of date very soon. Are you still on 10240 (RTM)?

Share this post


Link to post
Share on other sites

What about making the array static const instead of just static? IIRC fxc handled that differently.

I tried that, it made no difference.

Share this post


Link to post
Share on other sites

Because you're using a dynamic index into the array the hardware may have to go through a certain number of "replay" loops where it issues the load potentially up to 31 more times if every thread in the wave/warp used a different index.

 

I have ran some test trails which have the static array size as 8, 16, 32, 64, 128. the running time keeps increase almost linearly as static array size grows... so it seems the 'replay' times not settled done with 31..... which is very confusing.

 

BTW for those who wants to pass macro into  D3DCompileFromFile  in dx12, the  msdn doc is misleading: 

// Pass in macro as the following (msdn way) will fail 
D3D_SHADER_MACRO macro[] = 
{ 
    {"__hlsl",   "1"},  
};

// You have to add nullptrs as last item like this
D3D_SHADER_MACRO macro[] = 
{ 
    {"__hlsl",   "1"}, 
    {nullptr,    nullptr} 
};
Edited by Mr_Fox

Share this post


Link to post
Share on other sites

 

What about making the array static const instead of just static? IIRC fxc handled that differently.

I tried that, it made no difference.

 

 I tried that just now: in release mode, add const will make static array version run as fast as constant buffer array version, while in debug mode adding const make no difference. So probably is hlsl compiler optimization kicked in  

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!