D3D12 - Mapping and coping all constant buffers in a sigle operation

Graphics and GPU Programming Programming

Started by kretash November 16, 2015 08:41 PM

9 comments, last by Matias Goldberg 8 years, 5 months ago

kretash

196

Author

November 16, 2015 08:41 PM

Hello,

Im rendering a scene with a total of 14400 entities. It runs ok, but I saw that the biggest bottleneck was coping idividually all constant buffers for every single entity.

A thing that I'm doing wrong is that I'm updating buffers that will not be renderer because they are outside the frustum. But I went ahead an tried to create what I called an uber buffer that will be updated every frame with all matrices inside to reduce it to one call. Then i'll have a single static buffer for every instance that holds an instance_id, this buffer only beeing updated when created. It looks something like this.


cbuffer constant_buffer : register ( b0 ) {
  int instance_id;
}

#define MAX_INSTANCES 14428
cbuffer uber_buffer : register ( b1 ) {
  float4x4 mvp[MAX_INSTANCES];
}

This does not work as the maximun allowed size of a constant buffer is 4096 entries. Witch I think it means 4096 float4's, or 1024 float4x4's.

Are there any ways of speeding this up apart than from just avoiding updating the buffers that will not be used?

xycsoscyx

1,167

November 16, 2015 09:46 PM

Here's a good article that covers what you're dealing with:

https://developer.nvidia.com/content/constant-buffers-without-constant-pain-0

The basic idea is to use the discard flag with a smaller (whatever maximum size you want) buffer. This effectively lets you use a single buffer, but remapping it between draw calls. You create your first batch, filling up the buffer, then draw it. Then you map it again using the discard flag, copying in the next set of instances, then draw that. Rinse and repeat until you have nice sheen.

kretash

196

Author

November 16, 2015 10:16 PM

Thanks for the reply, that's exactly what I was looking for.

I'm gonna give it a try!

Matias Goldberg

9,637

November 17, 2015 12:52 AM

He's using D3D12, so there is no discard flag.

Basically your problem is that you seem to be mapping too often. You need to reserve a big chunk of memory and then map the whole thing once, then unmap it on shutdown (or earlier if you get memory warnings from D3D12 and can get away with a smaller chunk).

The general idea is similar to that NVIDIA article though (specially the DX11.1 part).

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

kretash

196

Author

November 17, 2015 03:11 PM

Ok, I just figured out how to do it in a single call. Im not sure if this is the correct way.

The first step is to create a single buffer that has the size of the struct in the shader times the instances that will be accessing it.


    HRESULT result;

    CD3DX12_HEAP_PROPERTIES heapProperties = CD3DX12_HEAP_PROPERTIES( D3D12_HEAP_TYPE_UPLOAD );
    CD3DX12_RESOURCE_DESC resourceDesc = CD3DX12_RESOURCE_DESC::Buffer( 
      sizeof( uber_buffer ) * k_engine->get_total_drawables() );

    result = k_engine->get_device()->CreateCommittedResource(
      &heapProperties,
      D3D12_HEAP_FLAG_NONE,
      &resourceDesc,
      D3D12_RESOURCE_STATE_GENERIC_READ,
      nullptr,
      IID_PPV_ARGS( &r->m_uber_buffer ) );
    assert( result == S_OK && "CREATING THE CONSTANT BUFFER FAILED" );
    r->m_uber_buffer->SetName( L"UBER BUFFER" );

Then when creating the view for the the constant buffer I set a "buffer_offset" that is equal to the size between the start and the element that I want to access in the shader. "buffer_size" is the size of a single struct. I then store that in the drescriptor heap.


    const UINT buffer_size = sizeof( uber_buffer ) + 255 & ~255;
    r->m_uber_buffer_desc = {};
    D3D12_GPU_VIRTUAL_ADDRESS addr = r->m_uber_buffer->GetGPUVirtualAddress();
    r->m_uber_buffer_desc.BufferLocation = addr + buffer_offset;
    r->m_uber_buffer_desc.SizeInBytes = buffer_size;

    CD3DX12_CPU_DESCRIPTOR_HANDLE cbvSrvHandle(
      r->m_cbv_srv_heap->GetCPUDescriptorHandleForHeapStart(),
      offset,
      r->m_cbv_srv_descriptor_size );

    k_engine->get_device()->CreateConstantBufferView( &r->m_uber_buffer_desc, cbvSrvHandle );

I then Map the whole array into the array of constant buffers.


HRESULT result = r->m_uber_buffer->Map( 0, nullptr, reinterpret_cast< void** >( &r->m_uber_buffer_WO ) );
assert( result == S_OK && "MAPPING THE CONSTANT BUFFER FALILED" );
memcpy( r->m_uber_buffer_WO, ub, sizeof( uber_buffer ) * k_engine->get_total_drawables() );
r->m_uber_buffer->Unmap( 0, nullptr );

Then i have the constant buffer defined in the shader. This constant buffer has to be a multiple of 256.


cbuffer uber_buffer : register ( b0 ) {
  float4x4 mvp;
  float4x4 model;
  float4x4 view;
  float4x4 projection;
}

The only step missing is binding the cbv in the render function.

The number of calls get reduced to 1 from 14000. But there has only been a slight improvement in performance, I'm gonna clean up the code and see if I find why.

Matias Goldberg

9,637

November 18, 2015 01:28 AM

You shouldn't be calling SetGraphicsRootDescriptorTable so often, nor IASetVertex/IndexBuffers either.

Don't create one vertex/index buffer per mesh. Create just one and bucket them in the same buffer at different offsets.

Your shaders should be indexing individual the data via baseInstance/drawID, in order to minimize SetGraphicsRootDescriptorTable calls.

And you should definitely not put the mvp matrix (which is per object) in the same buffer as the view and projection matrices (which is per camera pass).

The way you're setting up your renderer is how DX9 did things, and that's not going to run fast for +10k draws.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

kretash

196

Author

November 18, 2015 09:37 AM

I'll look into that. The reason I have the view and the projection matrices in the buffer is because it has to be a multiple of 256. They are not even being used in the shader, they are just padding.

Matias Goldberg

9,637

November 18, 2015 05:19 PM

That's why you need to index in the shader. You should be using a const buffer like this one:

struct InstanceData
{
  float4x4 mvp;
  float4x4 model;
};

cbuffer uber_buffer : register ( b0 ) {
  InstanceData i[512];
}; //uber_buffer is now 65536 bytes in size. Don't exceed this number for performance reasons with NVIDIA and Intel cards.

Index it in the shader. You only need call SetGraphicsRootDescriptorTable every 512 drawn models (1024 if you get rid of the mvp and only send the world matrix); or ignore the 64kb limit and call SetGraphicsRootDescriptorTable even less frequently (tradeoff a small GPU performance hit for a huge CPU performance gain, it depends on what your bottleneck is).

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

kretash

196

Author

November 20, 2015 11:40 AM

That looks a lot better, gonna give it try. Thanks!

WFP

2,787

November 21, 2015 09:00 PM

Hey Matias, quick question - when you use a constant buffer the way you described, do you just index into that buffer with a root constant that gets set per draw (SetGraphicsRoot32BitConstant)? That's the way that immediately comes to mind, but I may be missing something else obvious. Of course it'd follow the recommendation of being the first element in the root description, sorting by most to least frequently updated.

D3D12 - Mapping and coping all constant buffers in a sigle operation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

D3D12 - Mapping and coping all constant buffers in a sigle operation

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines