D3D12 - Mapping and coping all constant buffers in a sigle operation

Started by
9 comments, last by Matias Goldberg 8 years, 5 months ago

Hello,

Im rendering a scene with a total of 14400 entities. It runs ok, but I saw that the biggest bottleneck was coping idividually all constant buffers for every single entity.

untitled.png

A thing that I'm doing wrong is that I'm updating buffers that will not be renderer because they are outside the frustum. But I went ahead an tried to create what I called an uber buffer that will be updated every frame with all matrices inside to reduce it to one call. Then i'll have a single static buffer for every instance that holds an instance_id, this buffer only beeing updated when created. It looks something like this.


cbuffer constant_buffer : register ( b0 ) {
  int instance_id;
}

#define MAX_INSTANCES 14428
cbuffer uber_buffer : register ( b1 ) {
  float4x4 mvp[MAX_INSTANCES];
}

This does not work as the maximun allowed size of a constant buffer is 4096 entries. Witch I think it means 4096 float4's, or 1024 float4x4's.

Are there any ways of speeding this up apart than from just avoiding updating the buffers that will not be used?

Advertisement

Here's a good article that covers what you're dealing with:

https://developer.nvidia.com/content/constant-buffers-without-constant-pain-0

The basic idea is to use the discard flag with a smaller (whatever maximum size you want) buffer. This effectively lets you use a single buffer, but remapping it between draw calls. You create your first batch, filling up the buffer, then draw it. Then you map it again using the discard flag, copying in the next set of instances, then draw that. Rinse and repeat until you have nice sheen.

Thanks for the reply, that's exactly what I was looking for.

I'm gonna give it a try!

He's using D3D12, so there is no discard flag.

Basically your problem is that you seem to be mapping too often. You need to reserve a big chunk of memory and then map the whole thing once, then unmap it on shutdown (or earlier if you get memory warnings from D3D12 and can get away with a smaller chunk).

The general idea is similar to that NVIDIA article though (specially the DX11.1 part).

Ok, I just figured out how to do it in a single call. Im not sure if this is the correct way.

The first step is to create a single buffer that has the size of the struct in the shader times the instances that will be accessing it.


    HRESULT result;

    CD3DX12_HEAP_PROPERTIES heapProperties = CD3DX12_HEAP_PROPERTIES( D3D12_HEAP_TYPE_UPLOAD );
    CD3DX12_RESOURCE_DESC resourceDesc = CD3DX12_RESOURCE_DESC::Buffer( 
      sizeof( uber_buffer ) * k_engine->get_total_drawables() );

    result = k_engine->get_device()->CreateCommittedResource(
      &heapProperties,
      D3D12_HEAP_FLAG_NONE,
      &resourceDesc,
      D3D12_RESOURCE_STATE_GENERIC_READ,
      nullptr,
      IID_PPV_ARGS( &r->m_uber_buffer ) );
    assert( result == S_OK && "CREATING THE CONSTANT BUFFER FAILED" );
    r->m_uber_buffer->SetName( L"UBER BUFFER" );

Then when creating the view for the the constant buffer I set a "buffer_offset" that is equal to the size between the start and the element that I want to access in the shader. "buffer_size" is the size of a single struct. I then store that in the drescriptor heap.


    const UINT buffer_size = sizeof( uber_buffer ) + 255 & ~255;
    r->m_uber_buffer_desc = {};
    D3D12_GPU_VIRTUAL_ADDRESS addr = r->m_uber_buffer->GetGPUVirtualAddress();
    r->m_uber_buffer_desc.BufferLocation = addr + buffer_offset;
    r->m_uber_buffer_desc.SizeInBytes = buffer_size;

    CD3DX12_CPU_DESCRIPTOR_HANDLE cbvSrvHandle(
      r->m_cbv_srv_heap->GetCPUDescriptorHandleForHeapStart(),
      offset,
      r->m_cbv_srv_descriptor_size );

    k_engine->get_device()->CreateConstantBufferView( &r->m_uber_buffer_desc, cbvSrvHandle );

I then Map the whole array into the array of constant buffers.


HRESULT result = r->m_uber_buffer->Map( 0, nullptr, reinterpret_cast< void** >( &r->m_uber_buffer_WO ) );
assert( result == S_OK && "MAPPING THE CONSTANT BUFFER FALILED" );
memcpy( r->m_uber_buffer_WO, ub, sizeof( uber_buffer ) * k_engine->get_total_drawables() );
r->m_uber_buffer->Unmap( 0, nullptr );

Then i have the constant buffer defined in the shader. This constant buffer has to be a multiple of 256.


cbuffer uber_buffer : register ( b0 ) {
  float4x4 mvp;
  float4x4 model;
  float4x4 view;
  float4x4 projection;
}

The only step missing is binding the cbv in the render function.

QImVzfq.png

The number of calls get reduced to 1 from 14000. But there has only been a slight improvement in performance, I'm gonna clean up the code and see if I find why.

You shouldn't be calling SetGraphicsRootDescriptorTable so often, nor IASetVertex/IndexBuffers either.

Don't create one vertex/index buffer per mesh. Create just one and bucket them in the same buffer at different offsets.

Your shaders should be indexing individual the data via baseInstance/drawID, in order to minimize SetGraphicsRootDescriptorTable calls.

And you should definitely not put the mvp matrix (which is per object) in the same buffer as the view and projection matrices (which is per camera pass).

The way you're setting up your renderer is how DX9 did things, and that's not going to run fast for +10k draws.

I'll look into that. The reason I have the view and the projection matrices in the buffer is because it has to be a multiple of 256. They are not even being used in the shader, they are just padding.
That's why you need to index in the shader. You should be using a const buffer like this one:
struct InstanceData
{
  float4x4 mvp;
  float4x4 model;
};

cbuffer uber_buffer : register ( b0 ) {
  InstanceData i[512];
}; //uber_buffer is now 65536 bytes in size. Don't exceed this number for performance reasons with NVIDIA and Intel cards.
Index it in the shader. You only need call SetGraphicsRootDescriptorTable every 512 drawn models (1024 if you get rid of the mvp and only send the world matrix); or ignore the 64kb limit and call SetGraphicsRootDescriptorTable even less frequently (tradeoff a small GPU performance hit for a huge CPU performance gain, it depends on what your bottleneck is).
That looks a lot better, gonna give it try. Thanks!

Hey Matias, quick question - when you use a constant buffer the way you described, do you just index into that buffer with a root constant that gets set per draw (SetGraphicsRoot32BitConstant)? That's the way that immediately comes to mind, but I may be missing something else obvious. Of course it'd follow the recommendation of being the first element in the root description, sorting by most to least frequently updated.

This topic is closed to new replies.

Advertisement