Sign in to follow this  
cippyboy

DX11 Single vs Multiple Constant Buffers

Recommended Posts

So basically after upgrading from DX9 to DX10 I read a lot of docs from Microsoft about how it's better to organize constant buffers by update frequency, so I made 3 types of constant buffers:

 

PerFrame (view & projection matrix)

PerMaterial (MaterialColor, specular, shinyness,etc)

PerObject ( world matrix )

 

I didn't really thought about performance considerations though but one day it stroke me, how about I just make 1 buffer to encapsulate all data ? So after I did this I noticed that performance actually increased by ~3-5%, even though I was updating an entire sightly bigger buffer. I thought that maybe drivers at the time (I was having a HD5770, a first gen DX11 device) are not that optimized for multiple constant buffers and reverted back to multiple buffers.

 

I now have a HD7850 and after doing this little test again, I'm seeing a performance boost of up to +50% for ~100 drawcalls when having a single huge constant buffer per object. So in effect, the difference is not smaller, it's bigger, signalling that there's something inherently wrong with having too many constant buffers binded. I'm now assuming this may be because my buffers are fairly small. The huge constant buffers is around 460 bytes ( I only have 4 matrices, one light and a few other variables ), so perhaps the multiple buffer switches are more advantageous when you are doing something like fetching an entire vertex buffer (for real-time ambient aocclusion based on vertices) or when you work with skinned meshes of 100 bones each.

 

My question is if you have tried to render a scene with multiple buffers and with a single huge buffer and compared performance ?

Share this post


Link to post
Share on other sites

I suspect that with the size of your constant buffers, that the cost to bind is actually greater than the cost to pass a small amount of data to the shader. How many objects are you rendering? how many objects per material? the frequency of updates on these constant buffers would have implications in regards to your performance. Even an object with 1 bone will contain 2x as many bytes as your current per object buffer. Try to increase the size of that per object buffer and see what happens to your performance data.

Share this post


Link to post
Share on other sites

There are around 107 objects and for the most part it's just 1-3 objects per material, so I have around 107 materials too. However, with the 3 constant buffers I was only updating the perframe buffer once per frame and then each material had it's own permaterial buffer and each object it's own perobject buffer that didn't change ( I don't animate any objects or material properties currently), so only the view/projection matrices changed.

 

Also, why do you say 1 bone would be 2x my current per object buffer ? 1 bone would be just a float4x4 so that's like 64 bytes. I'm planning to do some skinning in the near future and I have around 30 bones, so I'm curious how that will go.

Share this post


Link to post
Share on other sites

How can I tell if it's CPU or GPU since both techniques result in a 99% GPU usage ? (according to Catalyst control center)

3 CB result in ~400 FPS (or 2.5ms per frame), and 1 CB results in ~660 FPS ( or ~1.51 ms per frame). The CPU is not a bottleneck, even GPU Perf Client says this, CPU is doing ~0.27ms work per frame.

 

Creating them with usage default, and using UpdateSubResource. I was previously using Map/Unmap with write discard but there's just no performance difference between the two.

How am I applying state changes ? Through PSSetConstantBuffers :). I do admit I call PSSetConstantBuffers 3 times for 3 constant buffers instead of calling it once with all the 3 buffers, but I kind of doubt my speed penalty of ~50% is due to state changes.

Share this post


Link to post
Share on other sites

PSSetConstantBuffers 3 times per object or 3 times per frame?

The only difference in your D3D code between 1 and 3 constant buffers should be how many bytes you send to the constant buffer for each object, except for once-per-frame setup. If you draw 1000 objects with 1 constant buffer you do 1 Map and 1 Draw per object. If you draw 1000 objects with 2 constant buffers, you still do 1 Map and 1 Draw per object, it's just that you Map a constant buffer with fewer bytes since the camera matrices are in a once-per-frame setup constant buffer that doesn't change between objects, and so doesn't need to be touched. You should never set the constant buffers with *SetConstantBuffers more than once per frame. Even if you change shaders the constant buffers remain set and do not need a new call to *SetConstantBuffers.

Share this post


Link to post
Share on other sites

"each object it's own perobject buffer that didn't change"

 

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

In the past, I've observed that the best performance comes from:

 

 - One per-frame buffer.

 - One per-material buffer, irrespective of how many materials you have.

 - One per-object buffer, irrespective of how many objects you have.

 

When changing materials you just Map with Discard, then write in the new material properties.  When drawing a new object you also Map with Discard and write in the new object properties.  This is substantially cheaper than having to switch buffers each time, and also lends itself well to piggybacking instancing on top of the same code when time comes to do that.

Share this post


Link to post
Share on other sites

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

 

interesting, benchmarking my engine I found calling SetConstantBuffers with pre filled (ie. one per scene material) buffers faster than having one buffer mapped with Map for every material change.

From what I understand, a constant buffer exists in video memory, so calling SetConstantBuffer is just updating "a pointer to data" on the GPU while Map is actually moving data from the cpu.

 

At the end of the day, always benchmark (on different GPUs) before committing to a strategy.

Edited by kunos

Share this post


Link to post
Share on other sites

"each object it's own perobject buffer that didn't change"

 

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

In the past, I've observed that the best performance comes from:

 

 - One per-frame buffer.

 - One per-material buffer, irrespective of how many materials you have.

 - One per-object buffer, irrespective of how many objects you have.

 

When changing materials you just Map with Discard, then write in the new material properties.  When drawing a new object you also Map with Discard and write in the new object properties.  This is substantially cheaper than having to switch buffers each time, and also lends itself well to piggybacking instancing on top of the same code when time comes to do that.

 

I tried that once, all materials having their own PerMaterial buffer versus having a single material buffer and constantly updating that but the performance difference was almost 0.

 

So basically one buffer update (the per frame buffer) and 300 buffer sets ( even though I set the same perframe buffer which should be a no-op ), is slower than 100 buffer updates and 100 buffer sets.

 

Have you tried just making one single buffer versus the 3 you're mapping/discarding on a per object basis ? That was my question all along, if someone tried one buffer versus multiple, I'm really curious about your results.

Share this post


Link to post
Share on other sites

Tried it right now, and got zero difference between them, even with abnormally large buffers, though in certain cases with one very large buffer and one very small and already a high usage of bandwidth to video memory for other things there can certainly be a difference...

 

Consider a case where a complex shader indexes into a constant buffer of a couple of thousand vectors that doesn't change between objects, and there's 1000 objects where the only change is the translation matrix, then uploading 64 bytes per object is much better than 32,000 bytes, especially if each frame there is also a couple of different transfers of reasonably large dynamic textures going on.

Share this post


Link to post
Share on other sites

Tried it right now, and got zero difference between them, even with abnormally large buffers, though in certain cases with one very large buffer and one very small and already a high usage of bandwidth to video memory for other things there can certainly be a difference...

 

Consider a case where a complex shader indexes into a constant buffer of a couple of thousand vectors that doesn't change between objects, and there's 1000 objects where the only change is the translation matrix, then uploading 64 bytes per object is much better than 32,000 bytes, especially if each frame there is also a couple of different transfers of reasonably large dynamic textures going on.

 

I'm not doubting the theoretical advantage but the real-life results are at least counter-intuitive for my buffer sizes at least. I'm assuming that caching play a much bigger role and buffer slots might not even be implemented properly. Like, instead of copying all 3 buffers into a contiguous chunk at shader processing time, I think they might just reference them from main GPU memory and if the slots are allocated at large address differences there might be a ton of cache misses there, hence a single buffer is more optimal than 3. But thanks for the results, I guess I'll just have to keep my engine design open to one versus multiple buffers based on usage pattern.

Share this post


Link to post
Share on other sites

I suspect that with the size of your constant buffers, that the cost to bind is actually greater than the cost to pass a small amount of data to the shader. How many objects are you rendering? how many objects per material? the frequency of updates on these constant buffers would have implications in regards to your performance. Even an object with 1 bone will contain 2x as many bytes as your current per object buffer. Try to increase the size of that per object buffer and see what happens to your performance data.

 

There are around 107 objects and for the most part it's just 1-3 objects per material, so I have around 107 materials too. However, with the 3 constant buffers I was only updating the perframe buffer once per frame and then each material had it's own permaterial buffer and each object it's own perobject buffer that didn't change ( I don't animate any objects or material properties currently), so only the view/projection matrices changed.

 

Also, why do you say 1 bone would be 2x my current per object buffer ? 1 bone would be just a float4x4 so that's like 64 bytes. I'm planning to do some skinning in the near future and I have around 30 bones, so I'm curious how that will go.

 

Your current per object buffer is just a single matrix. 1 bone would add another matrix. a mesh with 30 bones would still have the world matrix, plus another 30 bone matrices.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628320
    • Total Posts
      2982072
  • Similar Content

    • By GalacticCrew
      In some situations, my game starts to "lag" on older computers. I wanted to search for bottlenecks and optimize my game by searching for flaws in the shaders and the layer between CPU and GPU. My first step was to measure the time my render function needs to solve its tasks. Every second I wrote the accumulated times of each task into my console window. Each second it takes around
      170ms to call render functions for all models (including settings shader resources, updating constant buffers, drawing all indexed and non-indexed vertices, etc.) 40ms to render the UI 790ms to call SwapChain.Present <1ms to do the rest (updating structures, etc.) In my Swap Chain description I set a frame rate of 60 Hz, if its supported by the computer. It made sense for me that the Present function waits some time until it starts the next frame. However, I wanted to check, if this might be a problem for me. After a web search I found articles like this one, which states 
      My drivers are up-to-date so that's no issue. I installed Microsoft's PIX, but I was unable to use it. I could configure my game for x64, but PIX is not able to process DirectX 11.. After getting only error messages, I installed NVIDIA's NSight. After adjusting my game and installing all components, I couldn't get a proper result, but my game freezes after a new frames. I haven't figured out why. There is no exception, error message and other debug mechanisms like log messages and break points tell me the game freezes at the end of the render function after a few frames. So, I looked for another profiling tool and found Jeremy's GPUProfiler. However, the information returned by this tool are too basic to get an in-depth knowledge about my performance issues.
      Can anyone recommend a GPU Profiler or any other tool that might help me to find bottlenecks in my game and or that is able to indicate performance problems in my shaders? My custom graphics engine can handle subjects like multi-texturing, instancing, soft shadowing, animation, etc. However, I am pretty sure, there are things I can optimize!
      I am using SharpDX to develop a game (engine) based on DirectX 11 with .NET Framework 4.5. My graphics cards is from NVIDIA and my processor is made by Intel.
    • By GreenGodDiary
      SOLVED: I had written 
      Dispatch(32, 24, 0) instead of
      Dispatch(32, 24, 1)  
       
      I'm attempting to implement some basic post-processing in my "engine" and the HLSL part of the Compute Shader and such I think I've understood, however I'm at a loss at how to actually get/use it's output for rendering to the screen.
      Assume I'm doing something to a UAV in my CS:
      RWTexture2D<float4> InputOutputMap : register(u0); I want that texture to essentially "be" the backbuffer.
       
      I'm pretty certain I'm doing something wrong when I create the views (what I think I'm doing is having the backbuffer be bound as render target aswell as UAV and then using it in my CS):
       
      DXGI_SWAP_CHAIN_DESC scd; ZeroMemory(&scd, sizeof(DXGI_SWAP_CHAIN_DESC)); scd.BufferCount = 1; scd.BufferDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM; scd.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT | DXGI_USAGE_SHADER_INPUT | DXGI_USAGE_UNORDERED_ACCESS; scd.OutputWindow = wndHandle; scd.SampleDesc.Count = 1; scd.Windowed = TRUE; HRESULT hr = D3D11CreateDeviceAndSwapChain(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, NULL, NULL, NULL, D3D11_SDK_VERSION, &scd, &gSwapChain, &gDevice, NULL, &gDeviceContext); // get the address of the back buffer ID3D11Texture2D* pBackBuffer = nullptr; gSwapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), (LPVOID*)&pBackBuffer); // use the back buffer address to create the render target gDevice->CreateRenderTargetView(pBackBuffer, NULL, &gBackbufferRTV); // set the render target as the back buffer CreateDepthStencilBuffer(); gDeviceContext->OMSetRenderTargets(1, &gBackbufferRTV, depthStencilView); //UAV for compute shader D3D11_UNORDERED_ACCESS_VIEW_DESC uavd; ZeroMemory(&uavd, sizeof(uavd)); uavd.Format = DXGI_FORMAT_R8G8B8A8_UNORM; uavd.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D; uavd.Texture2D.MipSlice = 1; gDevice->CreateUnorderedAccessView(pBackBuffer, &uavd, &gUAV); pBackBuffer->Release();  
      After I render the scene, I dispatch like this:
      gDeviceContext->OMSetRenderTargets(0, NULL, NULL); m_vShaders["cs1"]->Bind(); gDeviceContext->CSSetUnorderedAccessViews(0, 1, &gUAV, 0); gDeviceContext->Dispatch(32, 24, 0); //hard coded ID3D11UnorderedAccessView* nullview = { nullptr }; gDeviceContext->CSSetUnorderedAccessViews(0, 1, &nullview, 0); gDeviceContext->OMSetRenderTargets(1, &gBackbufferRTV, depthStencilView); gSwapChain->Present(0, 0); Worth noting is the scene is rendered as usual, but I dont get any results from the CS (simple gaussian blur)
      I'm sure it's something fairly basic I'm doing wrong, perhaps my understanding of render targets / views / what have you is just completely wrong and my approach just makes no sense.

      If someone with more experience could point me in the right direction I would really appreciate it!

      On a side note, I'd really like to learn more about this kind of stuff. I can really see the potential of the CS aswell as rendering to textures and using them for whatever in the engine so I would love it if you know some good resources I can read about this!

      Thank you <3
       
      P.S I excluded the .hlsl since I cant imagine that being the issue, but if you think you need it to help me just ask

      P:P:S. As you can see this is my first post however I do have another account, but I can't log in with it because gamedev.net just keeps asking me to accept terms and then logs me out when I do over and over
    • By mister345
      Does buffer number matter in ID3D11DeviceContext::PSSetConstantBuffers()? I added 5 or six constant buffers to my framework, and later realized I had set the buffer number parameter to either 0 or 1 in all of them - but they still all worked! Curious why that is, and should they be set up to correspond to the number of constant buffers?
      Similarly, inside the buffer structs used to pass info into the hlsl shader, I added padding inside the c++ struct to make a struct containing a float3 be 16 bytes, but in the declaration of the same struct inside the hlsl shader file, it was missing the padding value - and it still worked! Do they need to be consistent or not? Thanks.
          struct CameraBufferType
          {
              XMFLOAT3 cameraPosition;
              float padding;
          };
    • By noodleBowl
      I was wondering if anyone could explain the depth buffer and the depth stencil state comparison function to me as I'm a little confused
      So I have set up a depth stencil state where the DepthFunc is set to D3D11_COMPARISON_LESS, but what am I actually comparing here? What is actually written to the buffer, the pixel that should show up in the front?
      I have these 2 quad faces, a Red Face and a Blue Face. The Blue Face is further away from the Viewer with a Z index value of -100.0f. Where the Red Face is close to the Viewer with a Z index value of 0.0f.
      When DepthFunc is set to D3D11_COMPARISON_LESS the Red Face shows up in front of the Blue Face like it should based on the Z index values. BUT if I change the DepthFunc to D3D11_COMPARISON_LESS_EQUAL the Blue Face shows in front of the Red Face. Which does not make sense to me, I would think that when the function is set to D3D11_COMPARISON_LESS_EQUAL the Red Face would still show up in front of the Blue Face as the Z index for the Red Face is still closer to the viewer
      Am I thinking of this comparison function all wrong?
      Vertex data just in case
      //Vertex date that make up the 2 faces Vertex verts[] = { //Red face Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), //Blue face Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), };  
    • By mellinoe
      Hi all,
      First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
      Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
      The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
      Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
  • Popular Now