[DX11] Instancing slows down instead of speeding up
Hai,
I'm rendering 400 objects with the same textures/indices/vertices.
Usually when rendering this you'd need 400 draw calls, and it ran at 60+ fps.
I figured when I made it render with instancing it would get quite a speedup. So after changing stuff a bit it now renders the 400 objects with only 2 draw calls, but the fps went to 30.
I ran it trough GPUPerfStudio and it said my fps was limited by my draw calls (when instancing), which doesn't make a lot of sense to me. How can 400 draw calls be fast and not be bottlenecking my code, whereas having only 2 draw calls bottleneck it? Isn't that what instancing is for? To reduce the amount of draw calls needed?
I'm instancing by filling a cbuffer with 256 world matrices and sending it to the shader, where it uses SV_InstanceID to get the appropriate world matrix from the cbuffer.
The cpu only runs at 40% or so while the app is running so that doesn't seem to be the bottleneck.
I've also tweaked the amount of instances that get rendered at the same time, 10, 20, 256, all of them seem to be severely slower than just rendering them normally.
So here comes the question:
How can using instancing for this slow my app down instead of speeding it up? Am i doing something wrong here or..?
i belive some setting is doing this! ( i dont know which ).
becus i made an ~15k objects render with around 100 fps.
becus i made an ~15k objects render with around 100 fps.
Is that with instancing or without it?
By the way, the normal way to do instancing is to have a separate stream with the instance data. Reading the data from a cbuffer is probably more costly.
By the way, the normal way to do instancing is to have a separate stream with the instance data. Reading the data from a cbuffer is probably more costly.
Have you tried this with storing the matrices into a texture buffer instead? I don't know if this causes design problems on your end, but it might be interesting to look at.
I too find it strange you get these result, instancing should indeed decrease the workload and increase the framerate in the way you describe your methods.
There is indeed some performance issues to be taking in account regarding cbuffers, but non should be that dramatic to the end result.
It might be helpful if you could provide some (pseudo-)code of your initialization and rendering procedures.
I too find it strange you get these result, instancing should indeed decrease the workload and increase the framerate in the way you describe your methods.
There is indeed some performance issues to be taking in account regarding cbuffers, but non should be that dramatic to the end result.
It might be helpful if you could provide some (pseudo-)code of your initialization and rendering procedures.
Maybe instancing isn't supported by the GFX card so the driver is doing it in software / without hardware acceleration?
@ET3D:
I'll have a look at doing it the 'normal' way, with a seperate stream.
The Nvidia "SkinnedInstancing" demo does it using a huge cbuffer so i figured that was a fast way to do it.
@Xeile:
I don't think it would be that much work to see what happens if i try it with a texture, but isn't writing to a texture on the gpu a lot slower than working with a cbuffer? (which are meant to be written to). I guess it would make an interesting test code though.
Here's some code:
Cbuffer creation:
Updating of the cbuffer with new data:
And here's the draw:
Some of the HLSL:
@ROBERTREAD1:
I have a HD4850 with the latest catalyst drivers, so i believe it is support. It may be possible however that the dx11 drivers dont quite support it properly yet though. But wouldn't my cpu usage be skyrocketing then?
I'll have a look at doing it the 'normal' way, with a seperate stream.
The Nvidia "SkinnedInstancing" demo does it using a huge cbuffer so i figured that was a fast way to do it.
@Xeile:
I don't think it would be that much work to see what happens if i try it with a texture, but isn't writing to a texture on the gpu a lot slower than working with a cbuffer? (which are meant to be written to). I guess it would make an interesting test code though.
Here's some code:
Cbuffer creation:
D3D11_BUFFER_DESC gpuBufDesc;gpuBufDesc.Usage = D3D11_USAGE_DEFAULT;gpuBufDesc.ByteWidth = desc.Size;gpuBufDesc.CPUAccessFlags = 0;gpuBufDesc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;gpuBufDesc.MiscFlags = 0;gpuBufDesc.StructureByteStride = 0;dev->CreateBuffer(&gpuBufDesc, nullptr, &gpuBuffer))
Updating of the cbuffer with new data:
context->UpdateSubresource(gpuBuffer, 0, nullptr, memoryBuffer->getBuffer(), memoryBuffer->getSize(), 0);
And here's the draw:
if(instancing) context->DrawIndexedInstanced(mat.getMeshBuffer()->getIndexCount(), instanceCount, 0, 0, 0);else context->DrawIndexed(mat.getMeshBuffer()->getIndexCount(), 0 , 0);
Some of the HLSL:
struct InstanceStruct{ matrix World : World;};cbuffer PerInstanceCB{ InstanceStruct InstanceData[MAX_INSTANCE_CONSTANTS] : InstanceData;}output.Pos = mul(input.Pos, InstanceData[input.IID].World);output.Pos = mul(output.Pos, View);output.Pos = mul(output.Pos, Projection);(And yes i know its faster to make a ViewProjection and multiply with that instead :))
@ROBERTREAD1:
I have a HD4850 with the latest catalyst drivers, so i believe it is support. It may be possible however that the dx11 drivers dont quite support it properly yet though. But wouldn't my cpu usage be skyrocketing then?
Any program using DirectX with a standard game loop will use close to 100% of one core of the CPU, unless you've written extra code to change that.
The reason is that the CPU will busy wait for the GPU if the GPU gets ahead to give the minimum delay between the GPU becoming ready for more data and your code getting run again.
The reason is that the CPU will busy wait for the GPU if the GPU gets ahead to give the minimum delay between the GPU becoming ready for more data and your code getting run again.
So first you have pointed out something interesting across the thread - which is that you are taking code from an NVIDIA demo and running it on an ATI card and expecting similar results. In this case those that posted before me about the cbuffer being your issue are probably pointing you in the right direction. It's not uncommon for the companies to post demo code that runs well on their hardware and poorly on the other guys hardware. -- the NVIDIA demo doesn't imply that there are other possibly even faster ways of doing this same work. You'll have to do a test on both sets of hardware to see what works best between them or else write two shaders, one for each IHV (which is a fairly normal thing to have to do if performance is important)
First, a note on transferring instance data to the GPU. There really shouldn't be any difference in data transfer speeds between a cbuffer and a texture. Both require writing data in blocks and doing DMA transfer, but there really isn't anything interesting there about getting data from the CPU to the GPU. When transferring lots of data to the GPU you might want to consider using a dynamic buffer. A dynamic buffer will give the driver more flexibility in scheduling the data transfer and in this case will also let you transfer a variable amount of data to the card depending on the number of instances you want to draw each frame. A cbuffer will force you to send the same amount of data as your HLSL declared every time, so you'll always be paying the maximum cost even if half of the data ends up being zeros.
So why might cbuffers be a problem? cbuffers have a very different cache structure from a texture. A cbuffer is optimized for in order access of constants while a tbuffer is optimized for random (with locality) access . It's possible that for every instance is starting off with a cache miss on the cbuffer since your index into them maybe quite different than what the driver/hardware expect for maximum performance (The driver might be preloading data based on expectation), but you'd have to code up a different solution to find out. Tbuffers may provide better cache hits and would allow partial updates and so could perform better overall. However I think you should also try using instanced dynamic vertex buffers for your data since those might have the best behavior (since the were designed to optimize this scenario). However there are occasional reports about people finding texture based model attributes perform better than the input assembler. But those reports are when using textures or tbuffers for data, not cbuffers. (this might actually depend more on if the mesh data is optimized for vertex caching or not) -- going further into this, the number of attributes in the vertex data also affects utilization of the vertex caching on some hardware.
Since there are so many variables here it can be hard to figure out the right slice to get the best performance so you may just have to try quite a few things. Be careful about expecting any one technique to work well on all cards -- especially between IHVs -- because this isn't often the case.
First, a note on transferring instance data to the GPU. There really shouldn't be any difference in data transfer speeds between a cbuffer and a texture. Both require writing data in blocks and doing DMA transfer, but there really isn't anything interesting there about getting data from the CPU to the GPU. When transferring lots of data to the GPU you might want to consider using a dynamic buffer. A dynamic buffer will give the driver more flexibility in scheduling the data transfer and in this case will also let you transfer a variable amount of data to the card depending on the number of instances you want to draw each frame. A cbuffer will force you to send the same amount of data as your HLSL declared every time, so you'll always be paying the maximum cost even if half of the data ends up being zeros.
So why might cbuffers be a problem? cbuffers have a very different cache structure from a texture. A cbuffer is optimized for in order access of constants while a tbuffer is optimized for random (with locality) access . It's possible that for every instance is starting off with a cache miss on the cbuffer since your index into them maybe quite different than what the driver/hardware expect for maximum performance (The driver might be preloading data based on expectation), but you'd have to code up a different solution to find out. Tbuffers may provide better cache hits and would allow partial updates and so could perform better overall. However I think you should also try using instanced dynamic vertex buffers for your data since those might have the best behavior (since the were designed to optimize this scenario). However there are occasional reports about people finding texture based model attributes perform better than the input assembler. But those reports are when using textures or tbuffers for data, not cbuffers. (this might actually depend more on if the mesh data is optimized for vertex caching or not) -- going further into this, the number of attributes in the vertex data also affects utilization of the vertex caching on some hardware.
Since there are so many variables here it can be hard to figure out the right slice to get the best performance so you may just have to try quite a few things. Be careful about expecting any one technique to work well on all cards -- especially between IHVs -- because this isn't often the case.
Well i expected the cbuffers to work well because the nvidia instancing demo works really fast here, but i guess it was rendering less than i am, and it does more stuff than just rendering plain meshes..
I just stumbled across this piece of info in the "A to Z of DX Performance" presentation of Nvidia, and it says this:
So I guess 'CB indexing' which is what i am doing is faster on Nvidia cards than on ATI cards, but i didnt expect THIS much of a performance decrease. I'll add instancing streams to my engine for ATI and see if it works better or not (if i figure out how, anyway)
I just stumbled across this piece of info in the "A to Z of DX Performance" presentation of Nvidia, and it says this:
Instance data:ATI: Ideally should come from additional streams (up to 32 with DX10.1)NVIDIA: Ideally should come from CB indexing
So I guess 'CB indexing' which is what i am doing is faster on Nvidia cards than on ATI cards, but i didnt expect THIS much of a performance decrease. I'll add instancing streams to my engine for ATI and see if it works better or not (if i figure out how, anyway)
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement