[D3D12] Multiple command queues

Started by
10 comments, last by Alessio1989 7 years, 9 months ago

Hello!

D3D 12 offers three types of command queues: graphics, compute and copy.

I am wondering how many of them could you use simultaneously to parallel work in practice.

1.Aysnc compute (graphics + compute) queues. I remember reading at this forum that NVidia and Intel do not really support this on the hardware level.

You are, indeed, able to create graphics and compute queues and issue the commands but they will be processesed on the same hardware queue. AMD, if I am not mistaken, does support this natively.

2.Does it make sense to have two graphics queue? For instance, to render shadow maps for spot and point lights on different queues concurrently.

Advertisement

1) Yeah, unfortunately a lot of hardware does not support concurrency execution of hardware and compute jobs. All AMD graphics cards support it, and the last architecture of NVIDIA (Pascal) should support it too (though I am not sure with the same efficiency of AMD GPUs). Older NVIDIA GPUs looks like they have huge performance penalties with "async-compute", especially Maxwell GPUs (I am not sure about Kepler GPUs, since they should software serialize all). As fore Intel iGPUs they do not support it too, but last time I tried (on a ULP Haswell iGPUs) the performance penalty due software (driver) serialization wasn't so huge as NVIDIA GPUs. Actually, if you want to support Async-compute, you have to write two different rendering path based on device and vendor IDs, since D3D12 does not expose any cap-bits telling if hardware support concurrency between graphics and compute queues execution.

2) AFIK, there is only one graphics queue per adapter node. You should use instead multiple command lists (on different threads) and barriers.

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

AFAIK there's three kinds of "support" for the "compute queue plus graphics queue" work-case.

  1. The GPU has one HW queue. The OS shares it between multiple SW queues. Switching SW queues requires a full GPU pipeline flush / stall.
  2. As above, but there's no need to flush the GPU between swapping SW queues.
  3. The GPU actually has many HW queues and the switching is done by the GPU itself.
  4. As above, but commands from the queue can actually be executed in parallel within different "shader cores".

NV has been in category #1 in the past, I'm not sure if they've reached #3/4 yet or are just at #2.

AMD has been at #4 for a while now. Their hardware queues are themselves made up of multiple pipes. e.g. 8 queues with 8 internal pipes gives 64 HW command lists in flight at once. Each HW queue can be given a different priority, and then the GPU can round-robin services the pipes within that queue.

As well as taking turns to read commands out of the different pipes and scheduling them for execution, the execution of these commands can actually overlap -- e.g. for a post-process compute job to execute simultaneously with a shadowmap rendering job. Or for a high priority compute task to start executing half-way through a long draw call. This is just for compute queues though - I'd usually assume that a single graphics queue is a safe bet.

Some GPU's might actually support two (or more) graphics queues (now or in the future)...

Async compute + graphics is very useful because it allows you to make use of HW resources that would otherwise be sitting idle. Shadowmap rendering is bottlenecked by the fixed-function rasterization hardware and the DB cache / ROP / OM stage, which means all the "shader cores" are sitting around idle. Procedural texturing will keep the "shader cores" ALU units busy, but leaves the memory controllers idle. Voxel ray-marching is pretty light on ALU cost, but completely stresses the memory controllers. Finding different tasks like this that have opposite bottlenecks, and then running them in parallel lets you take advantage of these idle HW resources.

This is actually a pretty hard thing to pull off though.

Most examples that I've seen where people have found a decent win is "do some compute work while rendering your shadow maps"... and having two graphics contexts doesn't fit into that use case -- they'd both competing for the fixed function rasterization hardware.

In the future it may become useful though -- e.g. running ALU heavy pixel shaders in parallel with memory-bound pixel shaders... Or running a multi-frame, ultra-low-priority light-baking shader in parallel with your actual realtime graphics, for lightmaps that are updated every few seconds as the time of day progresses. I've shipped a game that implemented that last feature, but it was a PITA, as we had to implement it very carefully so the lightbaker only did exactly 1ms of work every frame -- basically cooperative multitasking. Having a second graphics context of a different priority level might've made it much easier to implement.

Or another use-case is simply when multiple apps are running at once. Windows uses D3D to composite it's own desktop/windows now, so a windowed game is sharing the GPU with the OS and other apps too, which have their own graphics contexts!

As for copy queues, I'd assume that every vendor can benefit from you having one of these. Some GPU's may support more than one...

As for copy queues, I'd assume that every vendor can benefit from you having one of these. Some GPU's may support more than one...

This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

As for copy queues, I'd assume that every vendor can benefit from you having one of these. Some GPU's may support more than one...


This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

Do you have a reference for that? Maybe for CPU-side to CPU-side, or GPU-side to GPU-side transfers that's true... but I wouldn't think so for transfers between CPU-side and a dedicated GPU (across PCI-e) it would be.

The whole point of the copy queue is that it's designed to fully saturate the PCI-e bus while consuming zero shading/grahpics/compute resources (it's just a DMA controller being fed an "async memcpy" job). Intel say that their DMA controller has fairly low throughput, but, their "GPU-side RAM" is actually also "CPU-side RAM" so in some cases you'd just be able to use a regular background thread and have it perform the memcpy :lol:

AFAIK -
- if you're copying CPU->CPU, don't use the GPU, call memcpy :lol:
- if you're copying CPU->GPU or GPU->CPU, use the copy queue, except maybe if you're optimizing for Intel or a mobile platform.
- If you're copying GPU->GPU, probably use a compute queue, except maybe for SLI/crossfire (multi-adaptor) cases.

Last public reference are the one of Intel, telling what you say, and a short reference speaking about multi-GPU scenarios (where different scenarios are to be considered on both linked and unlinked adapters). Anyway it was told in different occasion that the best answer is always the same: profile!

EDIT: you are right about PCI-E transfer optimization, at least as AMD suggest for its own GPUs.

EDIT2: at least some NVIDIA GPUs have some performance limitations:

NVIDIA: Take care when copying depth+stencil resources – copying only depth may hit slow path

https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf

Anyway, probably I am confusing myself with older and not-public documentations...

"Recursion is the first step towards madness." - "Skegg?ld, Skálm?ld, Skildir ro Klofnir!"
Direct3D 12 quick reference: https://github.com/alessiot89/D3D12QuickRef/

Anyway, probably I am confusing myself with older and not-public documentations...

No you're not, I distinctly remember something public where the throughput of the copy queue was less then that of using the graphics hardware to perform the transfer. There was no explanation as to why. I can't remember where its from though, although it might be older as you say.

-potential energy is easily made kinetic-

Thank you guys for the explanation and examples! Definitely, I have better understanding now.

This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

Do you have a reference for that? Maybe for CPU-side to CPU-side, or GPU-side to GPU-side transfers that's true... but I wouldn't think so for transfers between CPU-side and a dedicated GPU (across PCI-e) it would be.

The whole point of the copy queue is that it's designed to fully saturate the PCI-e bus while consuming zero shading/grahpics/compute resources (it's just a DMA controller being fed an "async memcpy" job). Intel say that their DMA controller has fairly low throughput, but, their "GPU-side RAM" is actually also "CPU-side RAM" so in some cases you'd just be able to use a regular background thread and have it perform the memcpy :lol:

For references:

  • DX12PerfTweet 25: Copy queue consumes no shader resources but has less bandwidth than graphics and compute queues.
  • DX12PerfTweet 34: Use the copy queue for background tasks. Spinning for copy to finish is likely inefficient.
  • DX12PerfTweet 56: Use the COPY queue to move memory over PCI-Express: this is more efficient than using COMPUTE or DIRECT queue.
  • GPUOpen blog - Performance Tweets Series: Streaming & Memory Management: (...) The copy queue exposes the copy engine, which is a dedicated DMA engine designed around efficient transfers across the PCIe bus. (...) Before you run off and move all copies to the copy queue, keep in mind the copy queue is not designed for all copies. In fact, the copy engine is only optimized for transferring data over PCIe. It’s the only way to saturate PCIe bandwidth (...).

AFAIK -
- if you're copying CPU->CPU, don't use the GPU, call memcpy :lol:
- if you're copying CPU->GPU or GPU->CPU, use the copy queue, except maybe if you're optimizing for Intel or a mobile platform.
- If you're copying GPU->GPU, probably use a compute queue, except maybe for SLI/crossfire (multi-adaptor) cases.

That is pretty much it. Integrated GPUs will perform better if you write directly to the GPU memory from the CPU. It's a mystery to me whether this applies to AMD APUs as well.

For future reference too, I couldn't find it at the time, but I guess the function for CPU-driven CPU->GPU copies on Intel/UMA GPUs is ID3D12Resource::WriteToSubresource.

This topic is closed to new replies.

Advertisement