DX11 - Need to lock and unlock constant buffer every time you update it?

Graphics and GPU Programming Programming DX11 C++ PC

Started by mister345 December 01, 2017 04:26 AM

7 comments, last by SoldierOfLight 6 years, 4 months ago

Author

December 01, 2017 04:26 AM

While considering how to optimize my DirectX11 graphics engine, I noticed that it is mapping and unmapping (locking and unlocking) the D3D11_MAPPED_SUBRESOURCE many times to write to different constant buffers. Some shader have 10 or more contant buffers, for camera position, light direction, clip plane, texture translation, fog info, and many other things that need to be passed from the CPU to GPU.

I was wondering if all the mapping and unmapping might be the reason why my engine is running horribly slow, and is there any way around this? What is the correct way to do it?

(Refer to LightShaderClass::SetShaderParameters() function, line 401 onward to see all the mapping/unmapping).

https://github.com/mister51213/DirectX11Engine/blob/WaterShader/DirectX11Engine/LightShaderClass.cpp

I feel like I might be doing something obviously wrong and wasteful that could be fixed with a simple reorganization, but dont know enough about DX11 to know how. Any tips would be much appreciated, thanks.

Hodgman

52,717

December 01, 2017 04:47 AM

8 minutes ago, mister345 said:

I was wondering if all the mapping and unmapping might be the reason why my engine is running horribly slow

Don't wonder. Measure. You cannot optimize without being able to actually collect stats on your execution times. Add some timers to your code and measure where your CPU time is being spent.

Your code is pretty standard. I map over 100 cbuffers per frame in my game sometimes and still manage to run at 200fps, so it's not something to be too afraid of. Are you calling this once per frame, or once for each object? You could try using less cbuffers -- e.g. you could merge all those variables (camera, clipping, lights) into a single cbuffer.

There's one thing to watch out for, mentioned in the docs:

Quote
Don't read from a subresource mapped for writing

When you pass D3D11_MAP_WRITE, D3D11_MAP_WRITE_DISCARD, or D3D11_MAP_WRITE_NO_OVERWRITE to the MapType parameter, you must ensure that your app does not read the subresource data to which the pData member of D3D11_MAPPED_SUBRESOURCE points because doing so can cause a significant performance penalty.
Note
Even the following C++ code can read from memory and trigger the performance penalty because the code can expand to the following x86 assembly code.

C++ code:
*((int*)MappedResource.pData) = 0;
x86 assembly code:
AND DWORD PTR [EAX],0

To avoid this, the safest thing to do is to write into your own temporary structure on the stack, and then memcpy it into the mapped data.

. 22 Racing Series .

mister345

Author

December 01, 2017 05:31 AM

44 minutes ago, Hodgman said:

Don't wonder. Measure. You cannot optimize without being able to actually collect stats on your execution times. Add some timers to your code and measure where your CPU time is being spent.

Your code is pretty standard. I map over 100 cbuffers per frame in my game sometimes and still manage to run at 200fps, so it's not something to be too afraid of. Are you calling this once per frame, or once for each object? You could try using less cbuffers -- e.g. you could merge all those variables (camera, clipping, lights) into a single cbuffer.

There's one thing to watch out for, mentioned in the docs:

To avoid this, the safest thing to do is to write into your own temporary structure on the stack, and then memcpy it into the mapped data.

Thank you so much for your insights. I intend to write a proper profiler, but since time is extremely limited, might you give me some tips on low hanging fruit that would yield quick performance increases?

Currently, I'm not using SIMD (XMVECTOR... functions/types) but rather XMFLOAT3s and 4s and XMLoadFloat.... functions for everything. If I can only choose one to optimize (restructuing all the cbuffers into one big cbuffer, or changing all the XMFLOAT...s into XMVECTORs), which will make a bigger performance difference? Thanks again.

turanszkij

545

December 01, 2017 08:43 AM

3 hours ago, mister345 said:

Thank you so much for your insights. I intend to write a proper profiler, but since time is extremely limited, might you give me some tips on low hanging fruit that would yield quick performance increases?

Currently, I'm not using SIMD (XMVECTOR... functions/types) but rather XMFLOAT3s and 4s and XMLoadFloat.... functions for everything. If I can only choose one to optimize (restructuing all the cbuffers into one big cbuffer, or changing all the XMFLOAT...s into XMVECTORs), which will make a bigger performance difference? Thanks again.

Unfortunately there is no low hanging fruit here, you have to optimize for your specific case which we don't have enough knowledge about. Neither option is really straight forward though, you'll probably have to do rectructuring your engine for both. Doing conversion to SIMD XMVECTOR is also might not be actually faster for small operations because you'll have additional Load/Store instructions in your code for them.

3 hours ago, Hodgman said:

I map over 100 cbuffers per frame in my game sometimes and still manage to run at 200fps, so it's not something to be too afraid of.

Mapping 100s of cbuffers seems to me a bit low amount for an actual game, even when they are grouped properly like you mention. In the renderers I've been working on, there is still some per drawcall cbuffer updates happening most of the time so it is more like 1000s of cbuffer updates a frame. Are you talking about your 22 Racing Series? I can't imagine that game getting away with something like that low amount of updates, but correct me if I'm wrong. (PS the link in your signature doesn't work)

Wicked Engine

mister345

Author

December 01, 2017 08:51 AM

7 minutes ago, turanszkij said:

Unfortunately there is no low hanging fruit here, you have to optimize for your specific case which we don't have enough knowledge about. Neither option is really straight forward though, you'll probably have to do rectructuring your engine for both. Doing conversion to SIMD XMVECTOR is also might not be actually faster for small operations because you'll have additional Load/Store instructions in your code for them.

Mapping 100s of cbuffers seems to me a bit low amount for an actual game, even when they are grouped properly like you mention. In the renderers I've been working on, there is still some per drawcall cbuffer updates happening most of the time so it is more like 1000s of cbuffer updates a frame. Are you talking about your 22 Racing Series? I can't imagine that game getting away with something like that low amount of updates, but correct me if I'm wrong. (PS the link in your signature doesn't work)

Oh I see. So I installed DirectXTK using NugetPackage Manager, but its giving me this weird "mismatch" error, like it's referring to the wrong version or something. Any ideas about this?

LNK2038 mismatch detected for '_ITERATOR_DEBUG_LEVEL': value '2' doesn't match value '0' in graphicsclass.obj Engine C:\Users\n\Desktop\RastertekResources\dx11tut29\dx11tut29\Engine\Engine\DirectXTK.lib(DDSTextureLoader.obj) 1

Hodgman

52,717

December 01, 2017 12:33 PM

3 hours ago, turanszkij said:

Mapping 100s of cbuffers seems to me a bit low amount for an actual game, even when they are grouped properly like you mention. In the renderers I've been working on, there is still some per drawcall cbuffer updates happening most of the time so it is more like 1000s of cbuffer updates a frame. Are you talking about your 22 Racing Series? I can't imagine that game getting away with something like that low amount of updates, but correct me if I'm wrong. (PS the link in your signature doesn't work)

(Huh, I don't even see my signature on posts at the moment! thanks!)
Here's an example - 4 player split screen is a stress for the renderer (and IIRC, each viewport has two shadow cascades, so 12 scene traversals total):

And here's a profile of that situation: https://i.imgur.com/wRzQEol.png

IMHO these stats should be much lower, as we're not currently using any instancing, and have not optimized the content for draw-counts or LOD'ing (e.g. the vehicle alone is around 50 draw-items, due to overly complex materials and a node hierarchy, where really it could probably be as low as 2 draw-items). The stats in the screenshot are:


drawCount - 2061 - calls to DrawIndexed
texturesCount - 880 - calls to *SSetShaderResources
constantsCount - 717 - calls to *SSetConstantBuffers
samplersCount - 278 - calls to *SSetSamplers
renderTargetsCount - 38 - calls to OMSetRenderTargets
vertexBuffersCount - 220 - calls to IASetVertexBuffers
indexBufferCount - 107 - calls to IASetIndexBuffer
shaderCount - 691 - calls to *SSetShader - this feels a bit high, perhaps my sorting isn't working as well as I thought?
cBufferUpdateCount - 175 - calls to Map/Unmap (Discard) on a constant buffer
inputAssemblerCount - 6 - calls to IASetInputLayout
rasterCount - 31 - calls to RSSetState or RSSetScissorRects
depthTestCount - 21 - calls to OMSetDepthStencilState
blendCount - 324 - calls to OMSetBlendState - this is suspiciously high, I think I just found a bug :o

For static objects, we pre-create their per-draw cbuffers, as they don't change every frame (we use a view-proj matrix in the camera cbuffer, and a world matrix in the per-draw cbuffer, not a pre-combined world-view-proj matrix). For dynamic objects, their per-draw cbuffer usually belongs to the transformation node that they're attached to, so if there's multiple materials on one model node, they will all share a single cbuffer update. In a different renderer design, you could also use structured buffers for a lot of this data, instead of constant buffers.

In D3D11.1, there's also new cbuffer binding mechanisms that let you allocate multiple cbuffers within a larger buffer object at sub-offsets (like in GL4/D3D12/Vulkan), which allows you to perform the same number of cbuffer updates with less calls to Map/Unmap... However, we don't take advantage of these yet either.

7 hours ago, mister345 said:

I intend to write a proper profiler, but since time is extremely limited, might you give me some tips on low hanging fruit that would yield quick performance increases?

The things you mentioned might save you 10ms per frame, or 10us per frame. If it's the latter, then it's a complete waste of time even thinking about them! If time is limited, I'd advise getting a torch ASAP instead of swinging punches in the dark

You don't need a full profiling system. You've likely already got a timer if you've got a game loop, so read the timer before/after a block of code and printf what the difference is. If a major system is only taking a millisecond, then it's probably not your problem. If a minor system is taking 10ms, then add more timer printfs until you find the culprit in it.

Or, if you do want a full profiling system, go download Remotery. Within one coding session you'll get a fancy GUI like this where you can see a task hierarchy. There's quite a few other open source C++ profiling libraries that you could use instead, too.

3 hours ago, mister345 said:

Oh I see. So I installed DirectXTK using NugetPackage Manager, but its giving me this weird "mismatch" error, like it's referring to the wrong version or something. Any ideas about this?

LNK2038 mismatch detected for '_ITERATOR_DEBUG_LEVEL': value '2' doesn't match value '0' in graphicsclass.obj Engine C:\Users\n\Desktop\RastertekResources\dx11tut29\dx11tut29\Engine\Engine\DirectXTK.lib(DDSTextureLoader.obj) 1

I would just build it from source code, or add their source files directly to your game project

C++ is a pain when in that compiler versions and settings usually need to be an exact match for pre-built libraries to work... Apparently their code was built with a standard library debug validation layer enabled, and your code has it disabled.

. 22 Racing Series .

turanszkij

545

December 01, 2017 02:04 PM

9 hours ago, Hodgman said:

There's one thing to watch out for, mentioned in the docs:

[...]

To avoid this, the safest thing to do is to write into your own temporary structure on the stack, and then memcpy it into the mapped data.

The doc also mentions this:

Use the appropriate optimization settings and language constructs to help avoid this performance penalty. For example, you can avoid the xor optimization by using a volatile pointer or by optimizing for code speed instead of code size.

Does this mean that casting the mapped resource pointer to a volatile struct is safe in this case? I really want to make use of this for font rendering and instancing for example, because those sub-allocate from a ringbuffer and write to the returned address by casting..

Wicked Engine

SoldierOfLight

2,378

December 01, 2017 03:07 PM

For low-hanging fruit... you're building your project in 'release' mode, not 'debug' mode, right? That can yield surprising performance gains.