# DX11 [DX11] Instancing slows down instead of speeding up

This topic is 2993 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hai, I'm rendering 400 objects with the same textures/indices/vertices. Usually when rendering this you'd need 400 draw calls, and it ran at 60+ fps. I figured when I made it render with instancing it would get quite a speedup. So after changing stuff a bit it now renders the 400 objects with only 2 draw calls, but the fps went to 30. I ran it trough GPUPerfStudio and it said my fps was limited by my draw calls (when instancing), which doesn't make a lot of sense to me. How can 400 draw calls be fast and not be bottlenecking my code, whereas having only 2 draw calls bottleneck it? Isn't that what instancing is for? To reduce the amount of draw calls needed? I'm instancing by filling a cbuffer with 256 world matrices and sending it to the shader, where it uses SV_InstanceID to get the appropriate world matrix from the cbuffer. The cpu only runs at 40% or so while the app is running so that doesn't seem to be the bottleneck. I've also tweaked the amount of instances that get rendered at the same time, 10, 20, 256, all of them seem to be severely slower than just rendering them normally. So here comes the question: How can using instancing for this slow my app down instead of speeding it up? Am i doing something wrong here or..?

##### Share on other sites
i belive some setting is doing this! ( i dont know which ).
becus i made an ~15k objects render with around 100 fps.

##### Share on other sites
Is that with instancing or without it?

By the way, the normal way to do instancing is to have a separate stream with the instance data. Reading the data from a cbuffer is probably more costly.

##### Share on other sites
Have you tried this with storing the matrices into a texture buffer instead? I don't know if this causes design problems on your end, but it might be interesting to look at.

I too find it strange you get these result, instancing should indeed decrease the workload and increase the framerate in the way you describe your methods.

There is indeed some performance issues to be taking in account regarding cbuffers, but non should be that dramatic to the end result.

It might be helpful if you could provide some (pseudo-)code of your initialization and rendering procedures.

##### Share on other sites
Maybe instancing isn't supported by the GFX card so the driver is doing it in software / without hardware acceleration?

##### Share on other sites
@ET3D:
I'll have a look at doing it the 'normal' way, with a seperate stream.
The Nvidia "SkinnedInstancing" demo does it using a huge cbuffer so i figured that was a fast way to do it.

@Xeile:
I don't think it would be that much work to see what happens if i try it with a texture, but isn't writing to a texture on the gpu a lot slower than working with a cbuffer? (which are meant to be written to). I guess it would make an interesting test code though.

Here's some code:

Cbuffer creation:
D3D11_BUFFER_DESC gpuBufDesc;gpuBufDesc.Usage = D3D11_USAGE_DEFAULT;gpuBufDesc.ByteWidth = desc.Size;gpuBufDesc.CPUAccessFlags = 0;gpuBufDesc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;gpuBufDesc.MiscFlags = 0;gpuBufDesc.StructureByteStride = 0;dev->CreateBuffer(&gpuBufDesc, nullptr, &gpuBuffer))

Updating of the cbuffer with new data:
context->UpdateSubresource(gpuBuffer, 0, nullptr, memoryBuffer->getBuffer(), memoryBuffer->getSize(), 0);

And here's the draw:
if(instancing)	context->DrawIndexedInstanced(mat.getMeshBuffer()->getIndexCount(), instanceCount, 0, 0, 0);else	context->DrawIndexed(mat.getMeshBuffer()->getIndexCount(), 0 , 0);

Some of the HLSL:
struct InstanceStruct{	matrix World : World;};cbuffer PerInstanceCB{	InstanceStruct InstanceData[MAX_INSTANCE_CONSTANTS] : InstanceData;}output.Pos = mul(input.Pos, InstanceData[input.IID].World);output.Pos = mul(output.Pos, View);output.Pos = mul(output.Pos, Projection);(And yes i know its faster to make a ViewProjection and multiply with that instead :))

I have a HD4850 with the latest catalyst drivers, so i believe it is support. It may be possible however that the dx11 drivers dont quite support it properly yet though. But wouldn't my cpu usage be skyrocketing then?

##### Share on other sites
Any program using DirectX with a standard game loop will use close to 100% of one core of the CPU, unless you've written extra code to change that.

The reason is that the CPU will busy wait for the GPU if the GPU gets ahead to give the minimum delay between the GPU becoming ready for more data and your code getting run again.

##### Share on other sites
So first you have pointed out something interesting across the thread - which is that you are taking code from an NVIDIA demo and running it on an ATI card and expecting similar results. In this case those that posted before me about the cbuffer being your issue are probably pointing you in the right direction. It's not uncommon for the companies to post demo code that runs well on their hardware and poorly on the other guys hardware. -- the NVIDIA demo doesn't imply that there are other possibly even faster ways of doing this same work. You'll have to do a test on both sets of hardware to see what works best between them or else write two shaders, one for each IHV (which is a fairly normal thing to have to do if performance is important)

First, a note on transferring instance data to the GPU. There really shouldn't be any difference in data transfer speeds between a cbuffer and a texture. Both require writing data in blocks and doing DMA transfer, but there really isn't anything interesting there about getting data from the CPU to the GPU. When transferring lots of data to the GPU you might want to consider using a dynamic buffer. A dynamic buffer will give the driver more flexibility in scheduling the data transfer and in this case will also let you transfer a variable amount of data to the card depending on the number of instances you want to draw each frame. A cbuffer will force you to send the same amount of data as your HLSL declared every time, so you'll always be paying the maximum cost even if half of the data ends up being zeros.

So why might cbuffers be a problem? cbuffers have a very different cache structure from a texture. A cbuffer is optimized for in order access of constants while a tbuffer is optimized for random (with locality) access . It's possible that for every instance is starting off with a cache miss on the cbuffer since your index into them maybe quite different than what the driver/hardware expect for maximum performance (The driver might be preloading data based on expectation), but you'd have to code up a different solution to find out. Tbuffers may provide better cache hits and would allow partial updates and so could perform better overall. However I think you should also try using instanced dynamic vertex buffers for your data since those might have the best behavior (since the were designed to optimize this scenario). However there are occasional reports about people finding texture based model attributes perform better than the input assembler. But those reports are when using textures or tbuffers for data, not cbuffers. (this might actually depend more on if the mesh data is optimized for vertex caching or not) -- going further into this, the number of attributes in the vertex data also affects utilization of the vertex caching on some hardware.

Since there are so many variables here it can be hard to figure out the right slice to get the best performance so you may just have to try quite a few things. Be careful about expecting any one technique to work well on all cards -- especially between IHVs -- because this isn't often the case.

##### Share on other sites
Well i expected the cbuffers to work well because the nvidia instancing demo works really fast here, but i guess it was rendering less than i am, and it does more stuff than just rendering plain meshes..

I just stumbled across this piece of info in the "A to Z of DX Performance" presentation of Nvidia, and it says this:

Instance data:ATI: Ideally should come from additional streams (up to 32 with DX10.1)NVIDIA: Ideally should come from CB indexing

So I guess 'CB indexing' which is what i am doing is faster on Nvidia cards than on ATI cards, but i didnt expect THIS much of a performance decrease. I'll add instancing streams to my engine for ATI and see if it works better or not (if i figure out how, anyway)

• ### Similar Content

• By savail
Hey,
I can't find this information anywhere on the web and I'm wondering about specific optimization... Let's say I have hundreds of 3D textures which I need to process separately in compute shader. Each invocation needs different data in constant buffer BUT many of the 3d textures don't need to update their CB contents every frame. Would it be better to create just one CB resource, bind just once at startup and in loop map the data for each consecutive shader invocation or would it be better to create like hundreds of separate CB resources, map them only when needed and just bind appropriate CB before each shader invocation? This depends on how exacly are those resources managed internally in DirectX and what does binding actually do... I would be very grateful if somebody shared their experience!
• By Void
Hi, I'm trying to do a comparision with DirectInput GUID e.g GUID_XAxis, GUID_YAxis from a value I get from GetProperty
eg
DIPROPRANGE propRange;

DIJoystick->GetProperty (DIPROP_RANGE, &propRange.diph);
// This will crash
if (GUID_XAxis == MAKEDIPROP (propRange.diph.dwObj))
;

How should I be comparing the GUID from GetProperty?

• I have a problem with SSAO. On left hand black area.
Texture2D<uint> texGBufferNormal : register(t0); Texture2D<float> texGBufferDepth : register(t1); Texture2D<float4> texSSAONoise : register(t2); float3 GetUV(float3 position) { float4 vp = mul(float4(position, 1.0), ViewProject); vp.xy = float2(0.5, 0.5) + float2(0.5, -0.5) * vp.xy / vp.w; return float3(vp.xy, vp.z / vp.w); } float3 GetNormal(in Texture2D<uint> texNormal, in int3 coord) { return normalize(2.0 * UnpackNormalSphermap(texNormal.Load(coord)) - 1.0); } float3 GetPosition(in Texture2D<float> texDepth, in int3 coord) { float4 position = 1.0; float2 size; texDepth.GetDimensions(size.x, size.y); position.x = 2.0 * (coord.x / size.x) - 1.0; position.y = -(2.0 * (coord.y / size.y) - 1.0); position.z = texDepth.Load(coord); position = mul(position, ViewProjectInverse); position /= position.w; return position.xyz; } float3 GetPosition(in float2 coord, float depth) { float4 position = 1.0; position.x = 2.0 * coord.x - 1.0; position.y = -(2.0 * coord.y - 1.0); position.z = depth; position = mul(position, ViewProjectInverse); position /= position.w; return position.xyz; } float DepthInvSqrt(float nonLinearDepth) { return 1 / sqrt(1.0 - nonLinearDepth); } float GetDepth(in Texture2D<float> texDepth, float2 uv) { return texGBufferDepth.Sample(samplerPoint, uv); } float GetDepth(in Texture2D<float> texDepth, int3 screenPos) { return texGBufferDepth.Load(screenPos); } float CalculateOcclusion(in float3 position, in float3 direction, in float radius, in float pixelDepth) { float3 uv = GetUV(position + radius * direction); float d1 = DepthInvSqrt(GetDepth(texGBufferDepth, uv.xy)); float d2 = DepthInvSqrt(uv.z); return step(d1 - d2, 0) * min(1.0, radius / abs(d2 - pixelDepth)); } float GetRNDTexFactor(float2 texSize) { float width; float height; texGBufferDepth.GetDimensions(width, height); return float2(width, height) / texSize; } float main(FullScreenPSIn input) : SV_TARGET0 { int3 screenPos = int3(input.Position.xy, 0); float depth = DepthInvSqrt(GetDepth(texGBufferDepth, screenPos)); float3 normal = GetNormal(texGBufferNormal, screenPos); float3 position = GetPosition(texGBufferDepth, screenPos) + normal * SSAO_NORMAL_BIAS; float3 random = normalize(2.0 * texSSAONoise.Sample(samplerNoise, input.Texcoord * GetRNDTexFactor(SSAO_RND_TEX_SIZE)).rgb - 1.0); float SSAO = 0; [unroll] for (int index = 0; index < SSAO_KERNEL_SIZE; index++) { float3 dir = reflect(SamplesKernel[index].xyz, random); SSAO += CalculateOcclusion(position, dir * sign(dot(dir, normal)), SSAO_RADIUS, depth); } return 1.0 - SSAO / SSAO_KERNEL_SIZE; }

• I've been following this tutorial -> https://www.3dgep.com/introduction-to-directx-11/#The_Main_Function , did all the steps,and I ended up with the main.cpp you can see below.
The problem is the call at line 516
g_d3dDeviceContext->UpdateSubresource(g_d3dConstantBuffers[CB_Frame], 0, nullptr, &g_ViewMatrix, 0, 0); which is crashing the program, and the very odd thing is that the first time trough it works fine, it crash the app the second time it is called...
Can someone help me understand why? 😕    I have no idea...

• Hi guys, I'm trying to learn this stuff but running into some problems 😕
I've compiled my .hlsl into a header file which contains the global variable with the precompiled shader data:
//... // Approximately 83 instruction slots used #endif const BYTE g_vs[] = { 68, 88, 66, 67, 143, 82, 13, 236, 152, 133, 219, 113, 173, 135, 18, 87, 122, 208, 124, 76, 1, 0, 0, 0, 16, 76, 0, 0, 6, 0, //.... And now following the "Compiling at build time to header files" example at this msdn link , I've included the header files in my main.cpp and I'm trying to create the vertex shader like this:
hr = g_d3dDevice->CreateVertexShader(g_vs, sizeof(g_vs), nullptr, &g_d3dVertexShader); if (FAILED(hr)) { return -1; } and this is failing, entering the if and returing -1.
Can someone point out what I'm doing wrong? 😕

• Hello everyone,
After a few years of break from coding and my planet render game I'm giving it a go again from a different angle. What I'm struggling with now is that I have created a Frustum that works fine for now atleast, it does what it's supose to do alltho not perfect. But with the frustum came very low FPS, since what I'm doing right now just to see if the Frustum worked is to recreate the vertex buffer every frame that the camera detected movement. This is of course very costly and not the way to do it. Thats why I'm now trying to learn how to create a dynamic vertexbuffer instead and to map and unmap the vertexes, in the end my goal is to update only part of the vertexbuffer that is needed, but one step at a time ^^

So below is my code which I use to create the Dynamic buffer. The issue is that I want the size of the vertex buffer to be big enough to handle bigger vertex buffers then just mPlanetMesh.vertices.size() due to more vertices being added later when I start to do LOD and stuff, the first render isn't the biggest one I will need.
vertexBufferDesc.Usage = D3D11_USAGE_DYNAMIC; vertexBufferDesc.ByteWidth = mPlanetMesh.vertices.size(); vertexBufferDesc.BindFlags = D3D11_BIND_VERTEX_BUFFER; vertexBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE; vertexBufferDesc.MiscFlags = 0; vertexBufferDesc.StructureByteStride = 0; vertexData.pSysMem = &mPlanetMesh.vertices[0]; vertexData.SysMemPitch = 0; vertexData.SysMemSlicePitch = 0; result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &mVertexBuffer); if (FAILED(result)) { return false; } What happens is that the
result = device->CreateBuffer(&vertexBufferDesc, &vertexData, &mVertexBuffer); Makes it crash due to Access Violation. When I put the vertices.size() in it works without issues, but when I try to set it to like vertices.size() * 2 it crashes.
I googled my eyes dry tonight but doesn't seem to find people with the same kind of issue, I've read that the vertex buffer can be bigger if needed. What I'm I doing wrong here?

Best Regards and Thanks in advance
Toastmastern
• By yonisi
Hi,
I have a terrain engine where the terrain and water are on different grids. So I'm trying to render planar reflections of the terrain into the water grid. After reading some web pages and docs and also trying to learn from the RasterTek reflections demo and the small water bodies demo as well. What I do is as follows:
1. Create a Reflection view matrix  - Technically I ONLY flip the camera position in the Y direction (Positive Y is up) and add to it 2 * waterLevel. Then I update the View matrix and I save that matrix for later. The code:
void Camera::UpdateReflectionViewMatrix( float waterLevel ) { mBackupPosition = mPosition; mBackupLook = mLook; mPosition.y = -mPosition.y + 2.0f * waterLevel; //mLook.y = -mLook.y + 2.0f * waterLevel; UpdateViewMatrix(); mReflectionView = View(); } 2. I render the Terrain geometry to a 512x512 sized Render target by using the Reflection view matrix and an opposite culling (My Terrain is using front culling by nature so I'm using back culling for the Reflction render pass). Let me say that I checked with the Graphics debugger and the Reflection Render target looks "OK" at this stage (Picture attached). I don't know if the fact that the terrain is shown only at the top are of the texture is expected or not, but it seems OK.

3. Render the Reflection texture into the water using projective texturing - I hope this step is OK code wise. Basically I'm sending to the shader the WorldReflectionViewProj matrix that was created at step 1 in order to use it for the projective texture coordinates, I then convert the position in the DS (Water and terrain are drawn with Tessellation) to the projective tex coords using that WorldReflectionViewProj matrix, then I sample the reflection texture after setting up the coordinates in the PS. Here is the code:
//Send the ReflectionWorldViewProj matrix to the shader: XMStoreFloat4x4(&mPerFrameCB.Data.ReflectionWorldViewProj, XMMatrixTranspose( ( mWorld * pCam->GetReflectedView() ) * mProj )); //Setting up the Projective tex coords in the DS: Output.projTexPosition = mul(float4(worldPos.xyz, 1), g_ReflectionWorldViewProj); //Setting up the coords in the PS and sampling the reflection texture: float2 projTexCoords; projTexCoords.x = input.projTexPosition.x / input.projTexPosition.w / 2.0 + 0.5; projTexCoords.y = -input.projTexPosition.y / input.projTexPosition.w / 2.0 + 0.5; projTexCoords += normal.xz * 0.025; float4 reflectionColor = gReflectionMap.SampleLevel(SamplerClampLinear, projTexCoords, 0); texColor += reflectionColor * 0.25; I'll add that when compiling the PS I'm getting a warning on those dividing by input.projTexPosition.w for a possible float division by 0, I tried to add some offset or some minimum to the dividing term but that still not solved my issue.
Here is the problem itself. At relatively flat view angles I'm seeing correct reflections (Or at least so it seems), but as I pitch the camera down, I'm seeing those artifacts which I have no idea where are coming from. I'm culling the terrain in the reflection render pass when it's lower than water height (I have heightmaps for that).

Any help will be appreciated because I don't know what is wrong or where else to look.
• By thmfrnk
Hi,
I am looking for a usefull commandline based texture compression tool with the rights to be able to ship with my application. It should have following caps:
Supports all major image format as source files (jpeg, png, tga, bmp) Export as DDS Compression Formats BC1, BC2, BC3, BC4, BC7 I am actually using the nvdxt tool from Nvidia, but it does not support BC4 (which I need for one-channel 8bit textures). Everything else which I found wasn't really useful.
Any suggestions?
Thx

• I have been trying to create a BlendState for my UI text sprites so that they are both alpha-blended (so you can see them) and invert the pixel they are rendered over (again, so you can see them).
In order to get alpha blending you would need:
SrcBlend = SRC_ALPHA DestBlend = INV_SRC_ALPHA and in order to have inverted colours you would need something like:
SrcBlend = INV_DEST_COLOR DestBlend = INV_SRC_COLOR and you can't have both.
So I have come to the conclusion that it's not possible; am I right?
• By Royma
I want to know the reason that I reduced the drawcalls from 8 to 1, but it runs slow down.Should I abandon this method or is there any way to optimize this method to run more efficiently than multi-pass rendering?
Here is the gs code:

[maxvertexcount(24)]
void main(
triangle DepthGsIn input[3] : SV_POSITION,
inout TriangleStream< DepthPsIn > output
)
{
for (uint k = 0; k < 8; ++k)
{
DepthPsIn element;
element.RTIndex = k;
for (uint i = 0; i < 3; ++i)
{
element.position = input.position + shadowBias * g_cameras[k].world[1];
element.position = mul(element.position, g_cameras[k].viewProjection);
element.depth = element.position.z / element.position.w;

output.Append(element);
}
output.RestartStrip();
}
}

• 11
• 20
• 12
• 11
• 38
• ### Forum Statistics

• Total Topics
631401
• Total Posts
2999866
×