• Advertisement
Sign in to follow this  

DX11 Single vs Multiple Constant Buffers

This topic is 1597 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

So basically after upgrading from DX9 to DX10 I read a lot of docs from Microsoft about how it's better to organize constant buffers by update frequency, so I made 3 types of constant buffers:

 

PerFrame (view & projection matrix)

PerMaterial (MaterialColor, specular, shinyness,etc)

PerObject ( world matrix )

 

I didn't really thought about performance considerations though but one day it stroke me, how about I just make 1 buffer to encapsulate all data ? So after I did this I noticed that performance actually increased by ~3-5%, even though I was updating an entire sightly bigger buffer. I thought that maybe drivers at the time (I was having a HD5770, a first gen DX11 device) are not that optimized for multiple constant buffers and reverted back to multiple buffers.

 

I now have a HD7850 and after doing this little test again, I'm seeing a performance boost of up to +50% for ~100 drawcalls when having a single huge constant buffer per object. So in effect, the difference is not smaller, it's bigger, signalling that there's something inherently wrong with having too many constant buffers binded. I'm now assuming this may be because my buffers are fairly small. The huge constant buffers is around 460 bytes ( I only have 4 matrices, one light and a few other variables ), so perhaps the multiple buffer switches are more advantageous when you are doing something like fetching an entire vertex buffer (for real-time ambient aocclusion based on vertices) or when you work with skinned meshes of 100 bones each.

 

My question is if you have tried to render a scene with multiple buffers and with a single huge buffer and compared performance ?

Share this post


Link to post
Share on other sites
Advertisement

I suspect that with the size of your constant buffers, that the cost to bind is actually greater than the cost to pass a small amount of data to the shader. How many objects are you rendering? how many objects per material? the frequency of updates on these constant buffers would have implications in regards to your performance. Even an object with 1 bone will contain 2x as many bytes as your current per object buffer. Try to increase the size of that per object buffer and see what happens to your performance data.

Share this post


Link to post
Share on other sites

There are around 107 objects and for the most part it's just 1-3 objects per material, so I have around 107 materials too. However, with the 3 constant buffers I was only updating the perframe buffer once per frame and then each material had it's own permaterial buffer and each object it's own perobject buffer that didn't change ( I don't animate any objects or material properties currently), so only the view/projection matrices changed.

 

Also, why do you say 1 bone would be 2x my current per object buffer ? 1 bone would be just a float4x4 so that's like 64 bytes. I'm planning to do some skinning in the near future and I have around 30 bones, so I'm curious how that will go.

Share this post


Link to post
Share on other sites

Do you know if the performance difference is on the CPU-side, GPU-side, or both?

How many ms per frame is your game using in both scenarios?

How are you creating/updating the buffers?

How are you applying these state changes?

Share this post


Link to post
Share on other sites

How can I tell if it's CPU or GPU since both techniques result in a 99% GPU usage ? (according to Catalyst control center)

3 CB result in ~400 FPS (or 2.5ms per frame), and 1 CB results in ~660 FPS ( or ~1.51 ms per frame). The CPU is not a bottleneck, even GPU Perf Client says this, CPU is doing ~0.27ms work per frame.

 

Creating them with usage default, and using UpdateSubResource. I was previously using Map/Unmap with write discard but there's just no performance difference between the two.

How am I applying state changes ? Through PSSetConstantBuffers :). I do admit I call PSSetConstantBuffers 3 times for 3 constant buffers instead of calling it once with all the 3 buffers, but I kind of doubt my speed penalty of ~50% is due to state changes.

Share this post


Link to post
Share on other sites

PSSetConstantBuffers 3 times per object or 3 times per frame?

The only difference in your D3D code between 1 and 3 constant buffers should be how many bytes you send to the constant buffer for each object, except for once-per-frame setup. If you draw 1000 objects with 1 constant buffer you do 1 Map and 1 Draw per object. If you draw 1000 objects with 2 constant buffers, you still do 1 Map and 1 Draw per object, it's just that you Map a constant buffer with fewer bytes since the camera matrices are in a once-per-frame setup constant buffer that doesn't change between objects, and so doesn't need to be touched. You should never set the constant buffers with *SetConstantBuffers more than once per frame. Even if you change shaders the constant buffers remain set and do not need a new call to *SetConstantBuffers.

Share this post


Link to post
Share on other sites

"each object it's own perobject buffer that didn't change"

 

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

In the past, I've observed that the best performance comes from:

 

 - One per-frame buffer.

 - One per-material buffer, irrespective of how many materials you have.

 - One per-object buffer, irrespective of how many objects you have.

 

When changing materials you just Map with Discard, then write in the new material properties.  When drawing a new object you also Map with Discard and write in the new object properties.  This is substantially cheaper than having to switch buffers each time, and also lends itself well to piggybacking instancing on top of the same code when time comes to do that.

Share this post


Link to post
Share on other sites

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

 

interesting, benchmarking my engine I found calling SetConstantBuffers with pre filled (ie. one per scene material) buffers faster than having one buffer mapped with Map for every material change.

From what I understand, a constant buffer exists in video memory, so calling SetConstantBuffer is just updating "a pointer to data" on the GPU while Map is actually moving data from the cpu.

 

At the end of the day, always benchmark (on different GPUs) before committing to a strategy.

Edited by kunos

Share this post


Link to post
Share on other sites

"each object it's own perobject buffer that didn't change"

 

This is definitely the cause of your observed performance differences.  With - say - 100 objects, you're making 100 SetConstantBiuffers calls, and they're more expensive than discarding and refilling a single buffer.

 

In the past, I've observed that the best performance comes from:

 

 - One per-frame buffer.

 - One per-material buffer, irrespective of how many materials you have.

 - One per-object buffer, irrespective of how many objects you have.

 

When changing materials you just Map with Discard, then write in the new material properties.  When drawing a new object you also Map with Discard and write in the new object properties.  This is substantially cheaper than having to switch buffers each time, and also lends itself well to piggybacking instancing on top of the same code when time comes to do that.

 

I tried that once, all materials having their own PerMaterial buffer versus having a single material buffer and constantly updating that but the performance difference was almost 0.

 

So basically one buffer update (the per frame buffer) and 300 buffer sets ( even though I set the same perframe buffer which should be a no-op ), is slower than 100 buffer updates and 100 buffer sets.

 

Have you tried just making one single buffer versus the 3 you're mapping/discarding on a per object basis ? That was my question all along, if someone tried one buffer versus multiple, I'm really curious about your results.

Share this post


Link to post
Share on other sites

Tried it right now, and got zero difference between them, even with abnormally large buffers, though in certain cases with one very large buffer and one very small and already a high usage of bandwidth to video memory for other things there can certainly be a difference...

 

Consider a case where a complex shader indexes into a constant buffer of a couple of thousand vectors that doesn't change between objects, and there's 1000 objects where the only change is the translation matrix, then uploading 64 bytes per object is much better than 32,000 bytes, especially if each frame there is also a couple of different transfers of reasonably large dynamic textures going on.

Share this post


Link to post
Share on other sites

Tried it right now, and got zero difference between them, even with abnormally large buffers, though in certain cases with one very large buffer and one very small and already a high usage of bandwidth to video memory for other things there can certainly be a difference...

 

Consider a case where a complex shader indexes into a constant buffer of a couple of thousand vectors that doesn't change between objects, and there's 1000 objects where the only change is the translation matrix, then uploading 64 bytes per object is much better than 32,000 bytes, especially if each frame there is also a couple of different transfers of reasonably large dynamic textures going on.

 

I'm not doubting the theoretical advantage but the real-life results are at least counter-intuitive for my buffer sizes at least. I'm assuming that caching play a much bigger role and buffer slots might not even be implemented properly. Like, instead of copying all 3 buffers into a contiguous chunk at shader processing time, I think they might just reference them from main GPU memory and if the slots are allocated at large address differences there might be a ton of cache misses there, hence a single buffer is more optimal than 3. But thanks for the results, I guess I'll just have to keep my engine design open to one versus multiple buffers based on usage pattern.

Share this post


Link to post
Share on other sites

I suspect that with the size of your constant buffers, that the cost to bind is actually greater than the cost to pass a small amount of data to the shader. How many objects are you rendering? how many objects per material? the frequency of updates on these constant buffers would have implications in regards to your performance. Even an object with 1 bone will contain 2x as many bytes as your current per object buffer. Try to increase the size of that per object buffer and see what happens to your performance data.

 

There are around 107 objects and for the most part it's just 1-3 objects per material, so I have around 107 materials too. However, with the 3 constant buffers I was only updating the perframe buffer once per frame and then each material had it's own permaterial buffer and each object it's own perobject buffer that didn't change ( I don't animate any objects or material properties currently), so only the view/projection matrices changed.

 

Also, why do you say 1 bone would be 2x my current per object buffer ? 1 bone would be just a float4x4 so that's like 64 bytes. I'm planning to do some skinning in the near future and I have around 30 bones, so I'm curious how that will go.

 

Your current per object buffer is just a single matrix. 1 bone would add another matrix. a mesh with 30 bones would still have the world matrix, plus another 30 bone matrices.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
  • Advertisement
  • Popular Now

  • Advertisement
  • Similar Content

    • By mister345
      Hi, can somebody please tell me in clear simple steps how to debug and step through an hlsl shader file?
      I already did Debug > Start Graphics Debugging > then captured some frames from Visual Studio and
      double clicked on the frame to open it, but no idea where to go from there.
       
      I've been searching for hours and there's no information on this, not even on the Microsoft Website!
      They say "open the  Graphics Pixel History window" but there is no such window!
      Then they say, in the "Pipeline Stages choose Start Debugging"  but the Start Debugging option is nowhere to be found in the whole interface.
      Also, how do I even open the hlsl file that I want to set a break point in from inside the Graphics Debugger?
       
      All I want to do is set a break point in a specific hlsl file, step thru it, and see the data, but this is so unbelievably complicated
      and Microsoft's instructions are horrible! Somebody please, please help.
       
       
       

    • By mister345
      I finally ported Rastertek's tutorial # 42 on soft shadows and blur shading. This tutorial has a ton of really useful effects and there's no working version anywhere online.
      Unfortunately it just draws a black screen. Not sure what's causing it. I'm guessing the camera or ortho matrix transforms are wrong, light directions, or maybe texture resources not being properly initialized.  I didnt change any of the variables though, only upgraded all types and functions DirectX3DVector3 to XMFLOAT3, and used DirectXTK for texture loading. If anyone is willing to take a look at what might be causing the black screen, maybe something pops out to you, let me know, thanks.
      https://github.com/mister51213/DX11Port_SoftShadows
       
      Also, for reference, here's tutorial #40 which has normal shadows but no blur, which I also ported, and it works perfectly.
      https://github.com/mister51213/DX11Port_ShadowMapping
       
    • By xhcao
      Is Direct3D 11 an api function like glMemoryBarrier in OpenGL? For example, if binds a texture to compute shader, compute shader writes some values to texture, then dispatchCompute, after that, read texture content to CPU side. I know, In OpenGL, we could call glMemoryBarrier before reading to assure that texture all content has been updated by compute shader.
      How to handle incoherent memory access in Direct3D 11? Thank you.
    • By _Engine_
      Atum engine is a newcomer in a row of game engines. Most game engines focus on render
      techniques in features list. The main task of Atum is to deliver the best toolset; that’s why,
      as I hope, Atum will be a good light weighted alternative to Unity for indie games. Atum already
      has fully workable editor that has an ability to play test edited scene. All system code has
      simple ideas behind them and focuses on easy to use functionality. That’s why code is minimized
      as much as possible.
      Currently the engine consists from:
      - Scene Editor with ability to play test edited scene;
      - Powerful system for binding properties into the editor;
      - Render system based on DX11 but created as multi API; so, adding support of another GAPI
        is planned;
      - Controls system based on aliases;
      - Font system based on stb_truetype.h;
      - Support of PhysX 3.0, there are samples in repo that use physics;
      - Network code which allows to create server/clinet; there is some code in repo which allows
        to create a simple network game
      I plan to use this engine in multiplayer game - so, I definitely will evolve the engine. Also
      I plan to add support for mobile devices. And of course, the main focus is to create a toolset
      that will ease games creation.
      Link to repo on source code is - https://github.com/ENgineE777/Atum
      Video of work process in track based editor can be at follow link: 
       
       

    • By mister345
      I made a spotlight that
      1. Projects 3d models onto a render target from each light POV to simulate shadows
      2. Cuts a circle out of the square of light that has been projected onto the render target
      as a result of the light frustum, then only lights up the pixels inside that circle 
      (except the shadowed parts of course), so you dont see the square edges of the projected frustum.
       
      After doing an if check to see if the dot product of light direction and light to vertex vector is greater than .95
      to get my initial cutoff, I then multiply the light intensity value inside the resulting circle by the same dot product value,
      which should range between .95 and 1.0.
       
      This should give the light inside that circle a falloff from 100% lit to 0% lit toward the edge of the circle. However,
      there is no falloff. It's just all equally lit inside the circle. Why on earth, I have no idea. If someone could take a gander
      and let me know, please help, thank you so much.
      float CalculateSpotLightIntensity(     float3 LightPos_VertexSpace,      float3 LightDirection_WS,      float3 SurfaceNormal_WS) {     //float3 lightToVertex = normalize(SurfacePosition - LightPos_VertexSpace);     float3 lightToVertex_WS = -LightPos_VertexSpace;          float dotProduct = saturate(dot(normalize(lightToVertex_WS), normalize(LightDirection_WS)));     // METALLIC EFFECT (deactivate for now)     float metalEffect = saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace)));     if(dotProduct > .95 /*&& metalEffect > .55*/)     {         return saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace)));         //return saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace))) * dotProduct;         //return dotProduct;     }     else     {         return 0;     } } float4 LightPixelShader(PixelInputType input) : SV_TARGET {     float2 projectTexCoord;     float depthValue;     float lightDepthValue;     float4 textureColor;     // Set the bias value for fixing the floating point precision issues.     float bias = 0.001f;     // Set the default output color to the ambient light value for all pixels.     float4 lightColor = cb_ambientColor;     /////////////////// NORMAL MAPPING //////////////////     float4 bumpMap = shaderTextures[4].Sample(SampleType, input.tex);     // Expand the range of the normal value from (0, +1) to (-1, +1).     bumpMap = (bumpMap * 2.0f) - 1.0f;     // Change the COORDINATE BASIS of the normal into the space represented by basis vectors tangent, binormal, and normal!     float3 bumpNormal = normalize((bumpMap.x * input.tangent) + (bumpMap.y * input.binormal) + (bumpMap.z * input.normal));     //////////////// LIGHT LOOP ////////////////     for(int i = 0; i < NUM_LIGHTS; ++i)     {     // Calculate the projected texture coordinates.     projectTexCoord.x =  input.vertex_ProjLightSpace[i].x / input.vertex_ProjLightSpace[i].w / 2.0f + 0.5f;     projectTexCoord.y = -input.vertex_ProjLightSpace[i].y / input.vertex_ProjLightSpace[i].w / 2.0f + 0.5f;     if((saturate(projectTexCoord.x) == projectTexCoord.x) && (saturate(projectTexCoord.y) == projectTexCoord.y))     {         // Sample the shadow map depth value from the depth texture using the sampler at the projected texture coordinate location.         depthValue = shaderTextures[6 + i].Sample(SampleTypeClamp, projectTexCoord).r;         // Calculate the depth of the light.         lightDepthValue = input.vertex_ProjLightSpace[i].z / input.vertex_ProjLightSpace[i].w;         // Subtract the bias from the lightDepthValue.         lightDepthValue = lightDepthValue - bias;         float lightVisibility = shaderTextures[6 + i].SampleCmp(SampleTypeComp, projectTexCoord, lightDepthValue );         // Compare the depth of the shadow map value and the depth of the light to determine whether to shadow or to light this pixel.         // If the light is in front of the object then light the pixel, if not then shadow this pixel since an object (occluder) is casting a shadow on it.             if(lightDepthValue < depthValue)             {                 // Calculate the amount of light on this pixel.                 float lightIntensity = saturate(dot(bumpNormal, normalize(input.lightPos_LS[i])));                 if(lightIntensity > 0.0f)                 {                     // Determine the final diffuse color based on the diffuse color and the amount of light intensity.                     float spotLightIntensity = CalculateSpotLightIntensity(                         input.lightPos_LS[i], // NOTE - this is NOT NORMALIZED!!!                         cb_lights[i].lightDirection,                          bumpNormal/*input.normal*/);                     lightColor += cb_lights[i].diffuseColor*spotLightIntensity* .18f; // spotlight                     //lightColor += cb_lights[i].diffuseColor*lightIntensity* .2f; // square light                 }             }         }     }     // Saturate the final light color.     lightColor = saturate(lightColor);    // lightColor = saturate( CalculateNormalMapIntensity(input, lightColor, cb_lights[0].lightDirection));     // TEXTURE ANIMATION -  Sample pixel color from texture at this texture coordinate location.     input.tex.x += textureTranslation;     // BLENDING     float4 color1 = shaderTextures[0].Sample(SampleTypeWrap, input.tex);     float4 color2 = shaderTextures[1].Sample(SampleTypeWrap, input.tex);     float4 alphaValue = shaderTextures[3].Sample(SampleTypeWrap, input.tex);     textureColor = saturate((alphaValue * color1) + ((1.0f - alphaValue) * color2));     // Combine the light and texture color.     float4 finalColor = lightColor * textureColor;     /////// TRANSPARENCY /////////     //finalColor.a = 0.2f;     return finalColor; }  
      Light_vs.hlsl
      Light_ps.hlsl
  • Advertisement