• 10
• 10
• 12
• 12
• 14

# [d3d12] expected CPU impact of ExecuteIndirect

This topic is 970 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,

To better understand the behavior of ExecuteIndirect I took the MS D3D12ExecuteIndirect example and simply changed the number of tris to 320,000. I was initially expecting the CPU load to be little changed as I thought this offloaded execution of the drawlist entirely to the GPU (so the cpu should have more or less fixed overhead - and note the MS example offloads draw list culling to the CS so there is no per-frame work on that on the CPU-side). To my surprise however I immediately became CPU bound, more or less much to the same extent as if I were to stash all the draw commands in a bundle. (To be clear the single indirect draw command shown below is responsible for all the cycles).

I can only assume the MS example is not at issue (at least, it looks ok to me). So, is it (a) my understanding of ExecuteIndirect is flawed (b) the current NVidia driver on my hardware is emulating (353.62 on a GTX 970M respectively) or (c) <insert your opinion here>?

Thanks for the insights!

Thanks,

Jason

m_commandList->ExecuteIndirect(
m_commandSignature.Get(),
TriangleCount,
m_processedCommandBuffers[m_frameIndex].Get(),
0,
m_processedCommandBuffers[m_frameIndex].Get(),
CommandBufferSizePerFrame);



##### Share on other sites

Glancing over the sample's code for 2 minutes: Maybe because you're writing 78.13MB worth constant buffer data to a temporary location, then reading it again, then sending it every frame from CPU to GPU over the PCIe bus? (when the theoretical max of PCIe 3.0 16x is at 15.75GB/s, which means you can't go over 200 FPS assuming that's the only thing you do and ignoring PCIe overhead)

Meanwhile the compute shader will laugh skipping all the data that is not needed because it fails the frustum cull.

Use GPU-Z, use vendor-specific GPU counters and GPU profilers, and a CPU profiler; to find where the bottleneck is.

Edited by Matias Goldberg

##### Share on other sites

Glancing over the sample's code for 2 minutes: Maybe because you're writing 78.13MB worth constant buffer data to a temporary location, then reading it again, then sending it every frame from CPU to GPU over the PCIe bus? (when the theoretical max of PCIe 3.0 16x is at 15.75GB/s, which means you can't go over 200 FPS assuming that's the only thing you do and ignoring PCIe overhead)

Meanwhile the compute shader will laugh skipping all the data that is not needed because it fails the frustum cull.

No.

I immediately disabled OnUpdate for this reason. (And in any case it makes very little difference on my system).

As mentioned this is CPU bound so I used a CPU profiler. The majority of the time is spent in D3D12.dll under the ExecuteIndirect command.

##### Share on other sites

You could be evicting a lot of memory. Use GPUView to find out the actual stall patterns.

##### Share on other sites

OK, thanks for the advice. Tho I don't see why that would occur here (not a lot of memory in use, relatively) or if it were, why that would burn CPU under the ExecuteIndirect call.

##### Share on other sites

I advice you to try this in an AMD card.

This just came out today which makes me wonder if the NV driver is reading your commands and waiting for the GPU part to finish.

I suggest if it runs well on AMD, but not on NV; make your best to gather all possible evidence that it is truly CPU bound and present it to NVIDIA (GPUView captures, CPU Profiler screenshots, what happens when you alter the code in different ways; GPU profiler screenshots, GPU-Z screenshots, what happens if you make the shaders more expensive, etc).

##### Share on other sites

Interesting, and you're absolutely right - this is something that needs to be tested on AMD hardware as well... (I have 3 machines but unfortunately all are just different generations of NV, need to see about an AMD card for the desktop).