Vulkan queue's and how to handle them

Started by
7 comments, last by JoeJ 7 years, 5 months ago

Been playing around with Vulkan a bit lately, got the required 'hello triangle' program working. In all the demos/examples I've seen so far they only use 1 or 2 VkQueue's, usually one for graphics/compute and one for transfers. Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything. So there's definitely a large spread of capabilities which leaves me a bit unsure at how to properly utilize these queues.

The easiest solution is just to use one queue and be done with it, but this seems a real waste of Vulkan's capabilities. I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do. Being Vulkan my feeling is that that sort of stuff would be left to us programmers, and that most Vulkan drivers would tend towards the minimalistic side.

The second solution would be to try to create a unique VkQueue for each rendering thread. Since issuing any commands on a queue require host synchronization, this maps nicely to a multi-threaded rendering solution. Each thread gets its own queue, each its own command pool. Each thread can render, transfer, or whatever all independently, and the final presentation is then done by the master thread when everything is ready. Seems like a great solution to me, but only the NVidia GPUs have enough extra queues to go around.

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

A fourth idea was to create a sort of VkQueue pool, where each rendering thread could pull a VkQueue when it needs it, returning it when done. This would work, could lead to possible contention in the case of only 1 queue, undermining any multi-threaded benefits.

There's also a half dozen other minor variations of the 4 options above... At this point I'm not really sure how the Vulkan committee envisioned us using the queue's. How are you guys handling this situation?

Advertisement

I'm using just a single queue. I think it's fine - if you try and split up a frames worth of rendering across multiple queues you're going to have a hell of a time synchronising up everything between those queues.

I think the rule of thumb should be that you'd only use multiple queues in situations where you want the work to be performed at different rates and/or with different priorities. For example, you might be using the GPU to decompress textures on the fly (like Rage say). The timing of that work has a degree of separation from the main business of rendering a frame, so it might make sense for it to be done on a different queue. I don't think there'd be big gains to spreading out a single frame's worth of work across multiple queues.

"I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do." - Even using a single queue, you're still going to get an enormous amount of parallelization on pretty much every task you perform. The advantage of multiple queues is that it gives the scheduler an option to be clever in the event that there are bubbles that are preventing you from keeping the cores busy. I think multiple queues are slightly analogous to hyperthreading behaviour on CPUs, they give the hardware an opportunity to go off and do something else when one task is stalled, to stretch the analogy, imagine your GPU is like a CPU with 128 cores, it's not like using a single queue will utilize only a single core, it's more like you're using all 128 cores, but have disabled the hyperthreading on them.

Edit: This might be useful: http://stackoverflow.com/questions/37575012/should-i-try-to-use-as-many-queues-as-possible

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.

AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies

Early drivers supported 8 compute queues, so this may change.

The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.

AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).

I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.

But I noticed that their execution start and end timestamps overlap, so at least it works.

I will repeat this with less demanding shaders later - guess it becomes a win then.

What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).

Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(

Thanks for the input guys. Another question about VkCommandPool and thread pools. Having 1 pool per thread seems pretty straightforward if all the threads are synchronized every frame (create command buffers, synchronize, submit, release buffers, repeat). But if we want to hang on to some of the command buffers for reuse, things become a little more problematic.

Do you just not reuse command buffers (all command buffers are submit once and release)? Do you reuse command buffers only sparingly and for very specific situations(single thread pre-build them or something similar)? Do you not use thread pools and instead just use a dedicated/hardcoded threading model (ie. thread 1 does X, thread 2 does Y, etc...)? Do you try and send the reused command buffers (when done with them) back to the creating thread for release, or wrap the command pool in a mutex or some other synchronization? Are command pools lightweight enough to have 1 or 2 command buffers per pool, and just pass them as a group between the threads?

I generate per frame command buffers only for debug output.

Anything serious is: Create command buffers once with indirect dispatch commands, at runtime fill the dispatch buffer from a compute shader (e.g. doing frustum culling).

This makes the whole idea of multithreaded rendering obsolete - per frame the CPU only needs to submit the same command buffers so there is no point to use multiple threads.

I think this approach can scale up well to a complex graphics engine with very few exceptions.

E.g. if you want to keep occlusion culling on CPU, per frame this only results in a buffer upload to identify visible stuff. No new command buffer is necessary.

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.


AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies
Early drivers supported 8 compute queues, so this may change.

The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.
AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).
I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.
But I noticed that their execution start and end timestamps overlap, so at least it works.
I will repeat this with less demanding shaders later - guess it becomes a win then.

What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).
Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.



Welcome to the world of hardware differences and the lies they tell :)

So, NV.. ugh.. NV are basically a massive black box because unless you are an ISV of standing (I guess?) they pretty much don't tell you anything about how the important bits of their hardware work, which is a pain and leads to people trying to figure out wtf is going on when their hardware starts to run slowly.

The first thing we learn is that not all queues are created equally; even if you can create 16 queues which can consume all the commands this doesn't indicate how well it will execute. The point of contention doesn't in fact seem to be the CUDA or CU cores (NV and AMD respectively) but the front end dispatcher logic, at least in NV's case.

A bit of simplified GPU theory; Your commands, when submitted to the GPU are dispatched in order the front end command processor sees them. This command processor can only keep so many work packets in flight before it has to wait for resources. So, for example, it might be able to keep 10 'draw' packets in flight but if you submit an 11th, even if you have a CUDA/CU unit free which could do the work the command processor can't dispatch the work to it until it has free resources to track said work. Also, iirc, work is retired in-order : so if 'draw' packet '3' finishes before '1' then the command processor can still be blocked from dispatching more work out to the ALU segment of the GPU.

On anything pre-Pascal I would just avoid trying to interleave Gfx and compute queue work at all; just go with a single queue as the hardware seems to have problem keeping work sanely in flight when mixing. By all accounts Pascal seems to do better, but I've not seen many net wins in benchmarks from it (at least, not to a significant amount) so even with Pascal you might want to default to a single queue.
(Pre-emption doesn't really help with this either; that deals with the ability to swap state out so that other work can take over and it is a heavy operation; CPU wise it is closer to switching processes with the amount of overhead. Pascal's tweak here is the ability to suspend work at instruction boundaries.)

AMD are a lot more open with how their stuff works, which is good for us :)
Basically all AMD GCN based cards in the wild today will have hardware for 1 'Graphics queue' (which can consume gfx, compute and copy commands) and 2 DMA queues which can consume copy commands only.
The compute queues are likely both driver and hardware dependant however. When Vulkan first appeared my R290 reported back only 2 compute queues; it now reports 7. However I'm currently not sure how that maps to hardware; while the hardware has 8 'async compute units' this doesn't mean that it is one queue per ACE as each ACE can service 8 'queues' of instructions themselves. (So, in theory, Vulkan on my AMD hardware could report 64 compute only queues, 8*8) If it is then life gets even more fun because each ACE can maintain two pieces of work at once meaning you could launch, resources allowing, 14 'compute' jobs + N gfx jobs and not have anything block.

When it comes to using this stuff, certainly in an async manner, then it is important to consider what is going on at the same time.

If your gfx task is bandwidth heavy but ALU light then you might have some ALU spare for some compute work to take advantage of - but if you pair ALU heavy with ALU heavy, or bandwidth heavy with bandwidth heavy you might see a performance dip.

Ultimately the best thing you can do is probably to make your graphics/compute setup data driven in some way so you can reconfigure things based on hardware type and factors like resolution of the screen etc.

I certainly, however, wouldn't try to drive the hardware from multiple threads in to the same window/view - that feels like a terrible idea and problems waiting to happen.
- Jobs to build command lists
- Master job(s) to submit built commands in the correct order to the correct queue
That would be my take on the setup.

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.


I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(


To a degree you'll have to do this if you want maximal performance; if you don't want max performance I would question your Vulkan usage, more so if you don't want to deal with the hardware awareness which comes with it :)

However, it doesn't have to be too bad if you can design your rendering/compute system in such as way that it is flexiable enough to cover the differences - at the simplest level you have a graph which indicates your gfx/compute work and on AMD you dispatch to 2 queues and on NV you serialise in to a single queue in the correct order. (Also, don't forget Intel in all this).

Your shaders and the parameters to them will require tweaking too; AMD prefer small foot prints in the 'root' data because they have limited register space to preload. NV on the other hand are good with lots of stuff in the 'root' signature of the shaders. You'll also likely want to take advantage of shader extensions too for maximal performance on the respective hardware.

Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

I might be a bit out of date with the latest cards, but I thought best practice was to use 1 general + 1 transfer on NVidia/Intel (and on intel you should think twice about utilising the transfer queue), and 1 general + 1 transfer + 1 compute on AMD to take advantage of their async compute feature.
Yep, that's a good default position to take - compute of course depends on your workloads even with AMD but it's a good target to have as it means the spare ALU can be used even when the command processor has no more 'work slots' to hand out. (General feel is that AMD have less 'work slots' than NV on their graphics command processor, which is partly why NV have better performance with a single queue as you leave less ALU on the table by default, kind of a AMD can launch 5, NV can launch 10 thing.. I've just made those numbers up however, but you get the idea ;) )

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

However, as a rule, 1-1-1 AMD, 1-1 NV and potentially only gfx for Intel due to shared memory.

And being able to configure your app via data to start up in any of those modes is also a good idea for sanity reasons; ie do everything on the gfx queue to make sure things are sane with synchronisation before trying to introduce a compute queue.
(Same reason you'll want a Fence All The Things! mode for debug/sanity reasons)

Some relevant things covered here: http://gpuopen.com/vulkan-and-doom/

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

Let us know if you hear something... :)

This topic is closed to new replies.

Advertisement