• Advertisement

DX11 Frame allocator of constant buffers

Recommended Posts

Hi,

I am writing a linear allocator of per-frame constants using the DirectX 11.1 API. My plan is to replace the traditional constant allocation strategy, where most of the work is done by the driver behind my back, with a manual one inspired by the DirectX 12 and Vulkan APIs.
In brief, the allocator maintains a list of 64K pages, each page owns a constant buffer managed as a ring buffer. Each page has a history of the N previous frames. At the beginning of a new frame, the allocator retires the frames that have been processed by the GPU and frees up the corresponding space in each page. I use DirectX 11 queries for detecting when a frame is complete and the ID3D11DeviceContext1::VS/PSSetConstantBuffers1 methods for binding constant buffers with an offset.
The new allocator appears to be working but I am not 100% confident it is actually correct. In particular:
1) it relies on queries which I am not too familiar with. Are they 100% reliable ?
2) it maps/unmaps the constant buffer of each page at the beginning of a new frame and then writes the mapped memory as the frame is built. In pseudo code:
BeginFrame:
    page.data = device.Map(page.buffer)
    device.Unmap(page.buffer)
RenderFrame
    Alloc(size, initData)
        ...
        memcpy(page.data + page.start, initData, size)
    Alloc(size, initData)
        ...
        memcpy(page.data + page.start, initData, size)
(Note: calling Unmap at the end of a frame prevents binding the mapped constant buffers and triggers an error in the debug layer)
Is this valid ? 
3) I don't fully understand how many frames I should keep in the history. My intuition says it should be equal to the maximum latency reported by IDXGIDevice1::GetMaximumFrameLatency, which is 3 on my machine. But, this value works fine in an unit test while on a more complex demo I need to manually set it to 5, otherwise the allocator starts overwriting previous frames that have not completed yet. Shouldn't the swap chain Present method block the CPU in this case ?
4) Should I expect this approach to be more efficient than the one managed by the driver ? I don't have meaningful profile data yet.

Is anybody familiar with the approach described above and can answer my questions and discuss the pros and cons of this technique based on his experience ? 
For reference, I've uploaded the (WIP) allocator code at https://paste.ofcode.org/Bq98ujP6zaAuKyjv4X7HSv.  Feel free to adapt it in your engine and please let me know if you spot any mistakes :)

Thanks

Stefano Lanza
 

Share this post


Link to post
Share on other sites
Advertisement

Sorry I haven't had time to actually read your code, so quick answers:

1 hour ago, Reitano said:

1) it relies on queries which I am not too familiar with. Are they 100% reliable ?

If you're using them correctly, yes. Event queries are perfect for telling whether the GPU has completed a batch of commands yet or not.

1 hour ago, Reitano said:

    page.data = device.Map(page.buffer)
    device.Unmap(page.buffer)
...
   memcpy(page.data

That's undefined behaviour. The 'page.data' pointer is only valid in-between the call to Map and the call to Unmap. Writing to it after the call to Unmap is not allowed.

You have to map, write in a lot of constants, unmap, then bind and draw.

Yes, this sucks. In GL/D3D12/Vulkan you can do a "persistent map", where there is no Unmap call at all. In D3D11 I don't think persistent mapping is possible, so you've got to jump through hoops to keep the binding API happy. 

I'm not sure if you should restructure things so that you can upload all of your constants long before you start executing any draws, or if you should simply perform a lot more map/unmap calls. I haven't implemented this new cbuffer updating method in D3D11.1 yet because the traditional D3D11 methods have been performing fine for me so far :|

1 hour ago, Reitano said:

I don't fully understand how many frames I should keep in the history.

If you're using queries to track the GPU's progress, then the same number of frames that you're tracking with queries... but I guess that's a circular answer ;)

 To keep a GPU busy you typically need one frame's worth of completed commands queued up while you're working on the next frame's commands, so at least 2. If you want to be even more sure about keeping things smooth, go for 3. Any more than that and you're just adding excessive latency to your game IMHO.

1 hour ago, Reitano said:

the allocator starts overwriting previous frames that have not completed yet

You should use the queries to block the CPU yourself if the CPU is attempting to map a buffer that you know is still potentially in use by the GPU.

Instead of these queries being owned by the buffer system, I like to have the core rendering device own the event queries and use them to give a guarantee about how far behind the CPU the GPU can possibly be. e.g. if your core rendering device promises "the GPU will never be more than two frames behind the CPU", then other systems don't need their own queries -- they can simply make assumptions like "I'm currently preparing frame #8, which means the GPU is either working on frame #8, #7 or #6, so I can safely overwrite data from frame #5 without checking any queries".

[edit] Another lazy approach is to use write-discard instead of write-no-overwrite. This asks the driver to manage garbage collection of old data for you, and you don't have to think about where the GPU is up to... You could implement that for comparison and see how it differs in speed.

Share this post


Link to post
Share on other sites
3 hours ago, Reitano said:

3) I don't fully understand how many frames I should keep in the history. My intuition says it should be equal to the maximum latency reported by IDXGIDevice1::GetMaximumFrameLatency, which is 3 on my machine. But, this value works fine in an unit test while on a more complex demo I need to manually set it to 5, otherwise the allocator starts overwriting previous frames that have not completed yet. Shouldn't the swap chain Present method block the CPU in this case ?
4) Should I expect this approach to be more efficient than the one managed by the driver ? I don't have meaningful profile data yet.

For 3, like Hodgman said, if your event queries are working correctly, you shouldn't need to worry about this value. However with that said, the maximum frame latency is not 100% accurate, due to several factors. The drivers are able to override this frame latency, both explicitly as an override if an app never set anything, and implicitly by deferring the actual present operation until after the Present() API has returned. However, on new drivers and new OSes (Windows 10 Anniversary Update with WDDM2.1 drivers at least) using a FLIP_SEQUENTIAL or FLIP_DISCARD swap effect, the maximum frame latency should actually be accurate.

For 4... maybe. At best, you're getting simpler allocation strategies from the drivers because you're allocating large buffers instead of small ones, and are (maybe) running less code to do it. At worst, you're actually doing pretty much the exact same approach the driver would if you were using MAP_WRITE_DISCARD.

Share this post


Link to post
Share on other sites

Thank you guys for your replies. Mapping/unmapping and then writing to mapped memory indeed smells of undefined behaviour. So far it works on my machine but I should definitely test it on other GPUs to be more confident.I like this approach as the client code is quite concise, not requiring two calls to Map and Unmap for every constant upload operation. A pity DX11 does not have the concept of persistent mappings.

As for the latency, I am now ignoring the value returned by IDXGIDevice1::GetMaximumFrameLatency  and using instead a conservative latency equal to 5 for the allocator. I will also add a loop to block the CPU in case the number of queued frames goes above this value (which really shouldn't).

@SoldierOfLight

I will read about the new presentation modes. Thanks!

Edited by Reitano
spelling

Share this post


Link to post
Share on other sites
8 minutes ago, Reitano said:

So far it works on my machine but I should definitely test it on other GPUs to be more confident.

Even if it happens to work on 100 machines that you test on, it may still crash or cause memory corruption on the 101st one... or it may begin to crash/corrupt after the next driver update... 

It's simply luck that it works at all -- apparently your driver is keeping this address range persistently mapped by chance. Even with persistent mapping though, there's typically some kind of "synchronize" API call that you still use in place of "unmap", which ensures that the CPU's write combining buffer has been flushed (i.e. ensure that values that you've written have actually reached RAM before continuing) and instructs the GPU to invalidate this address range from its caches if it happens to be present. Assuming that this trick is actually giving you a persistently mapped buffer in D3D11, without these "synchronize" tasks being performed, it's still unsafe and the GPU may consume out of date data :(

Share this post


Link to post
Share on other sites

As alternative to Map/Unmap, you can also use UpdateSubResource1 as described in this article. That particular method also lets you avoid having to manually avoid writing to a buffer that the GPU is currently reading from, which is pretty dodgy to begin with in D3D11 since you don't have explicit submission or fences.

Share this post


Link to post
Share on other sites

Thank you all, you've been very helpful.

@Hodgmann

You are so right, I shouldn't even consider code with undefined behavior. I fixed the allocator to always have Map and Unmap calls around a memory write operation. On the API side, client code can use a convenient Upload method to upload small structures like camera data, and  manual Map/Unmap methods to upload potentially large chunks of data, like model instances, lights, materials etc. 

You can find the new code at https://codeshare.io/2p7ZbV

I am planning to refactor the rendering engine at a high level. The idea is to upload ALL constants in a first stage, and only at the end bind them and issue draw calls. This should allow a single call to Map/Unmap per constant buffer like I had originally.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Advertisement
  • Advertisement
  • Popular Now

  • Advertisement
  • Similar Content

    • By mister345
      Hi, can somebody please tell me in clear simple steps how to debug and step through an hlsl shader file?
      I already did Debug > Start Graphics Debugging > then captured some frames from Visual Studio and
      double clicked on the frame to open it, but no idea where to go from there.
       
      I've been searching for hours and there's no information on this, not even on the Microsoft Website!
      They say "open the  Graphics Pixel History window" but there is no such window!
      Then they say, in the "Pipeline Stages choose Start Debugging"  but the Start Debugging option is nowhere to be found in the whole interface.
      Also, how do I even open the hlsl file that I want to set a break point in from inside the Graphics Debugger?
       
      All I want to do is set a break point in a specific hlsl file, step thru it, and see the data, but this is so unbelievably complicated
      and Microsoft's instructions are horrible! Somebody please, please help.
       
       
       

    • By mister345
      I finally ported Rastertek's tutorial # 42 on soft shadows and blur shading. This tutorial has a ton of really useful effects and there's no working version anywhere online.
      Unfortunately it just draws a black screen. Not sure what's causing it. I'm guessing the camera or ortho matrix transforms are wrong, light directions, or maybe texture resources not being properly initialized.  I didnt change any of the variables though, only upgraded all types and functions DirectX3DVector3 to XMFLOAT3, and used DirectXTK for texture loading. If anyone is willing to take a look at what might be causing the black screen, maybe something pops out to you, let me know, thanks.
      https://github.com/mister51213/DX11Port_SoftShadows
       
      Also, for reference, here's tutorial #40 which has normal shadows but no blur, which I also ported, and it works perfectly.
      https://github.com/mister51213/DX11Port_ShadowMapping
       
    • By xhcao
      Is Direct3D 11 an api function like glMemoryBarrier in OpenGL? For example, if binds a texture to compute shader, compute shader writes some values to texture, then dispatchCompute, after that, read texture content to CPU side. I know, In OpenGL, we could call glMemoryBarrier before reading to assure that texture all content has been updated by compute shader.
      How to handle incoherent memory access in Direct3D 11? Thank you.
    • By _Engine_
      Atum engine is a newcomer in a row of game engines. Most game engines focus on render
      techniques in features list. The main task of Atum is to deliver the best toolset; that’s why,
      as I hope, Atum will be a good light weighted alternative to Unity for indie games. Atum already
      has fully workable editor that has an ability to play test edited scene. All system code has
      simple ideas behind them and focuses on easy to use functionality. That’s why code is minimized
      as much as possible.
      Currently the engine consists from:
      - Scene Editor with ability to play test edited scene;
      - Powerful system for binding properties into the editor;
      - Render system based on DX11 but created as multi API; so, adding support of another GAPI
        is planned;
      - Controls system based on aliases;
      - Font system based on stb_truetype.h;
      - Support of PhysX 3.0, there are samples in repo that use physics;
      - Network code which allows to create server/clinet; there is some code in repo which allows
        to create a simple network game
      I plan to use this engine in multiplayer game - so, I definitely will evolve the engine. Also
      I plan to add support for mobile devices. And of course, the main focus is to create a toolset
      that will ease games creation.
      Link to repo on source code is - https://github.com/ENgineE777/Atum
      Video of work process in track based editor can be at follow link: 
       
       

    • By mister345
      I made a spotlight that
      1. Projects 3d models onto a render target from each light POV to simulate shadows
      2. Cuts a circle out of the square of light that has been projected onto the render target
      as a result of the light frustum, then only lights up the pixels inside that circle 
      (except the shadowed parts of course), so you dont see the square edges of the projected frustum.
       
      After doing an if check to see if the dot product of light direction and light to vertex vector is greater than .95
      to get my initial cutoff, I then multiply the light intensity value inside the resulting circle by the same dot product value,
      which should range between .95 and 1.0.
       
      This should give the light inside that circle a falloff from 100% lit to 0% lit toward the edge of the circle. However,
      there is no falloff. It's just all equally lit inside the circle. Why on earth, I have no idea. If someone could take a gander
      and let me know, please help, thank you so much.
      float CalculateSpotLightIntensity(     float3 LightPos_VertexSpace,      float3 LightDirection_WS,      float3 SurfaceNormal_WS) {     //float3 lightToVertex = normalize(SurfacePosition - LightPos_VertexSpace);     float3 lightToVertex_WS = -LightPos_VertexSpace;          float dotProduct = saturate(dot(normalize(lightToVertex_WS), normalize(LightDirection_WS)));     // METALLIC EFFECT (deactivate for now)     float metalEffect = saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace)));     if(dotProduct > .95 /*&& metalEffect > .55*/)     {         return saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace)));         //return saturate(dot(SurfaceNormal_WS, normalize(LightPos_VertexSpace))) * dotProduct;         //return dotProduct;     }     else     {         return 0;     } } float4 LightPixelShader(PixelInputType input) : SV_TARGET {     float2 projectTexCoord;     float depthValue;     float lightDepthValue;     float4 textureColor;     // Set the bias value for fixing the floating point precision issues.     float bias = 0.001f;     // Set the default output color to the ambient light value for all pixels.     float4 lightColor = cb_ambientColor;     /////////////////// NORMAL MAPPING //////////////////     float4 bumpMap = shaderTextures[4].Sample(SampleType, input.tex);     // Expand the range of the normal value from (0, +1) to (-1, +1).     bumpMap = (bumpMap * 2.0f) - 1.0f;     // Change the COORDINATE BASIS of the normal into the space represented by basis vectors tangent, binormal, and normal!     float3 bumpNormal = normalize((bumpMap.x * input.tangent) + (bumpMap.y * input.binormal) + (bumpMap.z * input.normal));     //////////////// LIGHT LOOP ////////////////     for(int i = 0; i < NUM_LIGHTS; ++i)     {     // Calculate the projected texture coordinates.     projectTexCoord.x =  input.vertex_ProjLightSpace[i].x / input.vertex_ProjLightSpace[i].w / 2.0f + 0.5f;     projectTexCoord.y = -input.vertex_ProjLightSpace[i].y / input.vertex_ProjLightSpace[i].w / 2.0f + 0.5f;     if((saturate(projectTexCoord.x) == projectTexCoord.x) && (saturate(projectTexCoord.y) == projectTexCoord.y))     {         // Sample the shadow map depth value from the depth texture using the sampler at the projected texture coordinate location.         depthValue = shaderTextures[6 + i].Sample(SampleTypeClamp, projectTexCoord).r;         // Calculate the depth of the light.         lightDepthValue = input.vertex_ProjLightSpace[i].z / input.vertex_ProjLightSpace[i].w;         // Subtract the bias from the lightDepthValue.         lightDepthValue = lightDepthValue - bias;         float lightVisibility = shaderTextures[6 + i].SampleCmp(SampleTypeComp, projectTexCoord, lightDepthValue );         // Compare the depth of the shadow map value and the depth of the light to determine whether to shadow or to light this pixel.         // If the light is in front of the object then light the pixel, if not then shadow this pixel since an object (occluder) is casting a shadow on it.             if(lightDepthValue < depthValue)             {                 // Calculate the amount of light on this pixel.                 float lightIntensity = saturate(dot(bumpNormal, normalize(input.lightPos_LS[i])));                 if(lightIntensity > 0.0f)                 {                     // Determine the final diffuse color based on the diffuse color and the amount of light intensity.                     float spotLightIntensity = CalculateSpotLightIntensity(                         input.lightPos_LS[i], // NOTE - this is NOT NORMALIZED!!!                         cb_lights[i].lightDirection,                          bumpNormal/*input.normal*/);                     lightColor += cb_lights[i].diffuseColor*spotLightIntensity* .18f; // spotlight                     //lightColor += cb_lights[i].diffuseColor*lightIntensity* .2f; // square light                 }             }         }     }     // Saturate the final light color.     lightColor = saturate(lightColor);    // lightColor = saturate( CalculateNormalMapIntensity(input, lightColor, cb_lights[0].lightDirection));     // TEXTURE ANIMATION -  Sample pixel color from texture at this texture coordinate location.     input.tex.x += textureTranslation;     // BLENDING     float4 color1 = shaderTextures[0].Sample(SampleTypeWrap, input.tex);     float4 color2 = shaderTextures[1].Sample(SampleTypeWrap, input.tex);     float4 alphaValue = shaderTextures[3].Sample(SampleTypeWrap, input.tex);     textureColor = saturate((alphaValue * color1) + ((1.0f - alphaValue) * color2));     // Combine the light and texture color.     float4 finalColor = lightColor * textureColor;     /////// TRANSPARENCY /////////     //finalColor.a = 0.2f;     return finalColor; }  
      Light_vs.hlsl
      Light_ps.hlsl
  • Advertisement