Reitano

DX11 Frame allocator of constant buffers

Recommended Posts

Hi,

I am writing a linear allocator of per-frame constants using the DirectX 11.1 API. My plan is to replace the traditional constant allocation strategy, where most of the work is done by the driver behind my back, with a manual one inspired by the DirectX 12 and Vulkan APIs.
In brief, the allocator maintains a list of 64K pages, each page owns a constant buffer managed as a ring buffer. Each page has a history of the N previous frames. At the beginning of a new frame, the allocator retires the frames that have been processed by the GPU and frees up the corresponding space in each page. I use DirectX 11 queries for detecting when a frame is complete and the ID3D11DeviceContext1::VS/PSSetConstantBuffers1 methods for binding constant buffers with an offset.
The new allocator appears to be working but I am not 100% confident it is actually correct. In particular:
1) it relies on queries which I am not too familiar with. Are they 100% reliable ?
2) it maps/unmaps the constant buffer of each page at the beginning of a new frame and then writes the mapped memory as the frame is built. In pseudo code:
BeginFrame:
    page.data = device.Map(page.buffer)
    device.Unmap(page.buffer)
RenderFrame
    Alloc(size, initData)
        ...
        memcpy(page.data + page.start, initData, size)
    Alloc(size, initData)
        ...
        memcpy(page.data + page.start, initData, size)
(Note: calling Unmap at the end of a frame prevents binding the mapped constant buffers and triggers an error in the debug layer)
Is this valid ? 
3) I don't fully understand how many frames I should keep in the history. My intuition says it should be equal to the maximum latency reported by IDXGIDevice1::GetMaximumFrameLatency, which is 3 on my machine. But, this value works fine in an unit test while on a more complex demo I need to manually set it to 5, otherwise the allocator starts overwriting previous frames that have not completed yet. Shouldn't the swap chain Present method block the CPU in this case ?
4) Should I expect this approach to be more efficient than the one managed by the driver ? I don't have meaningful profile data yet.

Is anybody familiar with the approach described above and can answer my questions and discuss the pros and cons of this technique based on his experience ? 
For reference, I've uploaded the (WIP) allocator code at https://paste.ofcode.org/Bq98ujP6zaAuKyjv4X7HSv.  Feel free to adapt it in your engine and please let me know if you spot any mistakes :)

Thanks

Stefano Lanza
 

Share this post


Link to post
Share on other sites

Sorry I haven't had time to actually read your code, so quick answers:

1 hour ago, Reitano said:

1) it relies on queries which I am not too familiar with. Are they 100% reliable ?

If you're using them correctly, yes. Event queries are perfect for telling whether the GPU has completed a batch of commands yet or not.

1 hour ago, Reitano said:

    page.data = device.Map(page.buffer)
    device.Unmap(page.buffer)
...
   memcpy(page.data

That's undefined behaviour. The 'page.data' pointer is only valid in-between the call to Map and the call to Unmap. Writing to it after the call to Unmap is not allowed.

You have to map, write in a lot of constants, unmap, then bind and draw.

Yes, this sucks. In GL/D3D12/Vulkan you can do a "persistent map", where there is no Unmap call at all. In D3D11 I don't think persistent mapping is possible, so you've got to jump through hoops to keep the binding API happy. 

I'm not sure if you should restructure things so that you can upload all of your constants long before you start executing any draws, or if you should simply perform a lot more map/unmap calls. I haven't implemented this new cbuffer updating method in D3D11.1 yet because the traditional D3D11 methods have been performing fine for me so far :|

1 hour ago, Reitano said:

I don't fully understand how many frames I should keep in the history.

If you're using queries to track the GPU's progress, then the same number of frames that you're tracking with queries... but I guess that's a circular answer ;)

 To keep a GPU busy you typically need one frame's worth of completed commands queued up while you're working on the next frame's commands, so at least 2. If you want to be even more sure about keeping things smooth, go for 3. Any more than that and you're just adding excessive latency to your game IMHO.

1 hour ago, Reitano said:

the allocator starts overwriting previous frames that have not completed yet

You should use the queries to block the CPU yourself if the CPU is attempting to map a buffer that you know is still potentially in use by the GPU.

Instead of these queries being owned by the buffer system, I like to have the core rendering device own the event queries and use them to give a guarantee about how far behind the CPU the GPU can possibly be. e.g. if your core rendering device promises "the GPU will never be more than two frames behind the CPU", then other systems don't need their own queries -- they can simply make assumptions like "I'm currently preparing frame #8, which means the GPU is either working on frame #8, #7 or #6, so I can safely overwrite data from frame #5 without checking any queries".

[edit] Another lazy approach is to use write-discard instead of write-no-overwrite. This asks the driver to manage garbage collection of old data for you, and you don't have to think about where the GPU is up to... You could implement that for comparison and see how it differs in speed.

Share this post


Link to post
Share on other sites
3 hours ago, Reitano said:

3) I don't fully understand how many frames I should keep in the history. My intuition says it should be equal to the maximum latency reported by IDXGIDevice1::GetMaximumFrameLatency, which is 3 on my machine. But, this value works fine in an unit test while on a more complex demo I need to manually set it to 5, otherwise the allocator starts overwriting previous frames that have not completed yet. Shouldn't the swap chain Present method block the CPU in this case ?
4) Should I expect this approach to be more efficient than the one managed by the driver ? I don't have meaningful profile data yet.

For 3, like Hodgman said, if your event queries are working correctly, you shouldn't need to worry about this value. However with that said, the maximum frame latency is not 100% accurate, due to several factors. The drivers are able to override this frame latency, both explicitly as an override if an app never set anything, and implicitly by deferring the actual present operation until after the Present() API has returned. However, on new drivers and new OSes (Windows 10 Anniversary Update with WDDM2.1 drivers at least) using a FLIP_SEQUENTIAL or FLIP_DISCARD swap effect, the maximum frame latency should actually be accurate.

For 4... maybe. At best, you're getting simpler allocation strategies from the drivers because you're allocating large buffers instead of small ones, and are (maybe) running less code to do it. At worst, you're actually doing pretty much the exact same approach the driver would if you were using MAP_WRITE_DISCARD.

Share this post


Link to post
Share on other sites

Thank you guys for your replies. Mapping/unmapping and then writing to mapped memory indeed smells of undefined behaviour. So far it works on my machine but I should definitely test it on other GPUs to be more confident.I like this approach as the client code is quite concise, not requiring two calls to Map and Unmap for every constant upload operation. A pity DX11 does not have the concept of persistent mappings.

As for the latency, I am now ignoring the value returned by IDXGIDevice1::GetMaximumFrameLatency  and using instead a conservative latency equal to 5 for the allocator. I will also add a loop to block the CPU in case the number of queued frames goes above this value (which really shouldn't).

@SoldierOfLight

I will read about the new presentation modes. Thanks!

Edited by Reitano
spelling

Share this post


Link to post
Share on other sites
8 minutes ago, Reitano said:

So far it works on my machine but I should definitely test it on other GPUs to be more confident.

Even if it happens to work on 100 machines that you test on, it may still crash or cause memory corruption on the 101st one... or it may begin to crash/corrupt after the next driver update... 

It's simply luck that it works at all -- apparently your driver is keeping this address range persistently mapped by chance. Even with persistent mapping though, there's typically some kind of "synchronize" API call that you still use in place of "unmap", which ensures that the CPU's write combining buffer has been flushed (i.e. ensure that values that you've written have actually reached RAM before continuing) and instructs the GPU to invalidate this address range from its caches if it happens to be present. Assuming that this trick is actually giving you a persistently mapped buffer in D3D11, without these "synchronize" tasks being performed, it's still unsafe and the GPU may consume out of date data :(

Share this post


Link to post
Share on other sites

As alternative to Map/Unmap, you can also use UpdateSubResource1 as described in this article. That particular method also lets you avoid having to manually avoid writing to a buffer that the GPU is currently reading from, which is pretty dodgy to begin with in D3D11 since you don't have explicit submission or fences.

Share this post


Link to post
Share on other sites

Thank you all, you've been very helpful.

@Hodgmann

You are so right, I shouldn't even consider code with undefined behavior. I fixed the allocator to always have Map and Unmap calls around a memory write operation. On the API side, client code can use a convenient Upload method to upload small structures like camera data, and  manual Map/Unmap methods to upload potentially large chunks of data, like model instances, lights, materials etc. 

You can find the new code at https://codeshare.io/2p7ZbV

I am planning to refactor the rendering engine at a high level. The idea is to upload ALL constants in a first stage, and only at the end bind them and issue draw calls. This should allow a single call to Map/Unmap per constant buffer like I had originally.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      627776
    • Total Posts
      2979021
  • Similar Content

    • By AxeGuywithanAxe
      I wanted to get some advice on what everyone thinks of this debugger, I've been getting some strange results from testing my code and I wanted to see if anyone else had an issues.
      For instance, I added three "ClearRenderTargetView" calls and three "Draw full screen quad" calls and my reported fps became a fifth of what it usually was. Thank you.
    • By schneckerstein
      Hello,
      I manged so far to implement NVIDIA's NDF-Filtering at a basic level (the paper can be found here). Here is my code so far:
      //... // project the half vector on the normal (?) float3 hppWS = halfVector / dot(halfVector, geometricNormal) float2 hpp = float2(dot(hppWS, wTangent), dot(hppWS, wBitangent)); // compute the pixel footprint float2x2 dhduv = float2x2(ddx(hpp), ddy(hpp)); // compute the rectangular area of the pixel footprint float2 rectFp = min((abs(dhduv[0]) + abs(dhduv[1])) * 0.5, 0.3); // map the area to ggx roughness float2 covMx = rectFp * rectFp * 2; roughness = sqrt(roughness * roughness + covMx); //... Now I want combine this with LEAN mapping as state in Chapter 5.5 of the NDF paper.
      But I struggle to understand what theses sections actually means in Code: 
      I suppose the first-order moments are the B coefficent of the LEAN map, however things like
      float3 hppWS = halfVector / dot(halfVector, float3(lean_B, 0)); doesn't bring up anything usefull.
      Next theres:
      This simply means:
      // M and B are the coefficents from the LEAN map float2x2 sigma_mat = float2x2( M.x - B.x * B.x, M.z - B.x * B.y, M.z - B.x * B.y, M.y - B.y * B.y); does it?
      Finally:
      This is the part confuses me the most: how am I suppose to convolute two matrices? I know the concept of convolution in terms of functions, not matrices. Should I multiple them? That didn't make any usefully output.
      I hope someone can help with this maybe too specific question, I'm really despaired to make this work and i've spend too many hours of trial & error...
      Cheers,
      Julian
    • By Baemz
      Hello,
      I've been working on some culling-techniques for a project. We've built our own engine so pretty much everything is built from scratch. I've set up a frustum with the following code, assuming that the FOV is 90 degrees.
      float angle = CU::ToRadians(45.f); Plane<float> nearPlane(Vector3<float>(0, 0, aNear), Vector3<float>(0, 0, -1)); Plane<float> farPlane(Vector3<float>(0, 0, aFar), Vector3<float>(0, 0, 1)); Plane<float> right(Vector3<float>(0, 0, 0), Vector3<float>(angle, 0, -angle)); Plane<float> left(Vector3<float>(0, 0, 0), Vector3<float>(-angle, 0, -angle)); Plane<float> up(Vector3<float>(0, 0, 0), Vector3<float>(0, angle, -angle)); Plane<float> down(Vector3<float>(0, 0, 0), Vector3<float>(0, -angle, -angle)); myVolume.AddPlane(nearPlane); myVolume.AddPlane(farPlane); myVolume.AddPlane(right); myVolume.AddPlane(left); myVolume.AddPlane(up); myVolume.AddPlane(down); When checking the intersections I am using a BoundingSphere of my models, which is calculated by taking the average position of all vertices and then choosing the furthest distance to a vertex for radius. The actual intersection test looks like this, where the "myFrustum90" is the actual frustum described above.
      The orientationInverse is the viewMatrix in this case.
      bool CFrustum::Intersects(const SFrustumCollider& aCollider) { CU::Vector4<float> position = CU::Vector4<float>(aCollider.myCenter.x, aCollider.myCenter.y, aCollider.myCenter.z, 1.f) * myOrientationInverse; return myFrustum90.Inside({ position.x, position.y, position.z }, aCollider.myRadius); } The Inside() function looks like this.
      template <typename T> bool PlaneVolume<T>::Inside(Vector3<T> aPosition, T aRadius) const { for (unsigned short i = 0; i < myPlaneList.size(); ++i) { if (myPlaneList[i].ClassifySpherePlane(aPosition, aRadius) > 0) { return false; } } return true; } And this is the ClassifySpherePlane() function. (The plane is defined as a Vector4 called myABCD, where ABC is the normal)
      template <typename T> inline int Plane<T>::ClassifySpherePlane(Vector3<T> aSpherePosition, float aSphereRadius) const { float distance = (aSpherePosition.Dot(myNormal)) - myABCD.w; // completely on the front side if (distance >= aSphereRadius) { return 1; } // completely on the backside (aka "inside") if (distance <= -aSphereRadius) { return -1; } //sphere intersects the plane return 0; }  
      Please bare in mind that this code is not optimized nor well-written by any means. I am just looking to get it working.
      The result of this culling is that the models seem to be culled a bit "too early", so that the culling is visible and the models pops away.
      How do I get the culling to work properly?
      I have tried different techniques but haven't gotten any of them to work.
      If you need more code or explanations feel free to ask for it.

      Thanks.
       
    • By evelyn4you
      hi,
      i have read very much about the binding of a constantbuffer to a shader but something is still unclear to me.
      e.g. when performing :   vertexshader.setConstantbuffer ( buffer,  slot )
       is the buffer bound
      a.  to the VertexShaderStage
      or
      b. to the VertexShader that is currently set as the active VertexShader
      Is it possible to bind a constantBuffer to a VertexShader e.g. VS_A and keep this binding even after the active VertexShader has changed ?
      I mean i want to bind constantbuffer_A  to VS_A, an Constantbuffer_B to VS_B  and  only use updateSubresource without using setConstantBuffer command every time.

      Look at this example:
      SetVertexShader ( VS_A )
      updateSubresource(buffer_A)
      vertexshader.setConstantbuffer ( buffer_A,  slot_A )
      perform drawcall       ( buffer_A is used )

      SetVertexShader ( VS_B )
      updateSubresource(buffer_B)
      vertexshader.setConstantbuffer ( buffer_B,  slot_A )
      perform drawcall   ( buffer_B is used )
      SetVertexShader ( VS_A )
      perform drawcall   (now which buffer is used ??? )
       
      I ask this question because i have made a custom render engine an want to optimize to
      the minimum  updateSubresource, and setConstantbuffer  calls
       
       
       
       
       
    • By noodleBowl
      I got a quick question about buffers when it comes to DirectX 11. If I bind a buffer using a command like:
      IASetVertexBuffers IASetIndexBuffer VSSetConstantBuffers PSSetConstantBuffers  and then later on I update that bound buffer's data using commands like Map/Unmap or any of the other update commands.
      Do I need to rebind the buffer again in order for my update to take effect? If I dont rebind is that really bad as in I get a performance hit? My thought process behind this is that if the buffer is already bound why do I need to rebind it? I'm using that same buffer it is just different data
       
  • Popular Now