Sign in to follow this  
Ryan_001

Vulkan Vulkan queue's and how to handle them

Recommended Posts

Been playing around with Vulkan a bit lately, got the required 'hello triangle' program working.  In all the demos/examples I've seen so far they only use 1 or 2 VkQueue's, usually one for graphics/compute and one for transfers.  Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

 

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue.  I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.  So there's definitely a large spread of capabilities which leaves me a bit unsure at how to properly utilize these queues.

 

The easiest solution is just to use one queue and be done with it, but this seems a real waste of Vulkan's capabilities.  I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do.  Being Vulkan my feeling is that that sort of stuff would be left to us programmers, and that most Vulkan drivers would tend towards the minimalistic side.

 

The second solution would be to try to create a unique VkQueue for each rendering thread.  Since issuing any commands on a queue require host synchronization, this maps nicely to a multi-threaded rendering solution.  Each thread gets its own queue, each its own command pool.  Each thread can render, transfer, or whatever all independently, and the final presentation is then done by the master thread when everything is ready.  Seems like a great solution to me, but only the NVidia GPUs have enough extra queues to go around.

 

A third option would be to write separate rendering  pipelines for each card type.  Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 

A fourth idea was to create a sort of VkQueue pool, where each rendering thread could pull a VkQueue when it needs it, returning it when done.  This would work, could lead to possible contention in the case of only 1 queue, undermining any multi-threaded benefits.

 

There's also a half dozen other minor variations of the 4 options above...  At this point I'm not really sure how the Vulkan committee envisioned us using the queue's.  How are you guys handling this situation?

Share this post


Link to post
Share on other sites

I'm using just a single queue. I think it's fine - if you try and split up a frames worth of rendering across multiple queues you're going to have a hell of a time synchronising up everything between those queues.

 

I think the rule of thumb should be that you'd only use multiple queues in situations where you want the work to be performed at different rates and/or with different priorities. For example, you might be using the GPU to decompress textures on the fly (like Rage say). The timing of that work has a degree of separation from the main business of rendering a frame, so it might make sense for it to be done on a different queue. I don't think there'd be big gains to spreading out a single frame's worth of work across multiple queues.

 

"I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do." - Even using a single queue, you're still going to get an enormous amount of parallelization on pretty much every task you perform. The advantage of multiple queues is that it gives the scheduler an option to be clever in the event that there are bubbles that are preventing you from keeping the cores busy. I think multiple queues are slightly analogous to hyperthreading behaviour on CPUs, they give the hardware an opportunity to go off and do something else when one task is stalled, to stretch the analogy, imagine your GPU is like a CPU with 128 cores, it's not like using a single queue will utilize only a single core, it's more like you're using all 128 cores, but have disabled the hyperthreading on them.

 

Edit: This might be useful: http://stackoverflow.com/questions/37575012/should-i-try-to-use-as-many-queues-as-possible

Edited by C0lumbo

Share this post


Link to post
Share on other sites

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.

 

AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies

Early drivers supported 8 compute queues, so this may change.

 

The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.

AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).

I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.

But I noticed that their execution start and end timestamps overlap, so at least it works.

I will repeat this with less demanding shaders later - guess it becomes a win then.

 

What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).

Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.

 

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 

I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(

Share this post


Link to post
Share on other sites

Thanks for the input guys.  Another question about VkCommandPool and thread pools.  Having 1 pool per thread seems pretty straightforward if all the threads are synchronized every frame (create command buffers, synchronize, submit, release buffers, repeat).  But if we want to hang on to some of the command buffers for reuse, things become a little more problematic.

 

Do you just not reuse command buffers (all command buffers are submit once and release)?  Do you reuse command buffers only sparingly and for very specific situations(single thread pre-build them or something similar)?  Do you not use thread pools and instead just use a dedicated/hardcoded threading model (ie. thread 1 does X, thread 2 does Y, etc...)?  Do you try and send the reused command buffers (when done with them) back to the creating thread for release, or wrap the command pool in a mutex or some other synchronization?  Are command pools lightweight enough to have 1 or 2 command buffers per pool, and just pass them as a group between the threads?

Share this post


Link to post
Share on other sites

I generate per frame command buffers only for debug output.

 

Anything serious is: Create command buffers once with indirect dispatch commands, at runtime fill the dispatch buffer from a compute shader (e.g. doing frustum culling).

This makes the whole idea of multithreaded rendering obsolete - per frame the CPU only needs to submit the same command buffers so there is no point to use multiple threads.

 

I think this approach can scale up well to a complex graphics engine with very few exceptions.

E.g. if you want to keep occlusion culling on CPU, per frame this only results in a buffer upload to identify visible stuff. No new command buffer is necessary.

Share this post


Link to post
Share on other sites

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.

 
AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies
Early drivers supported 8 compute queues, so this may change.
 
The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.
AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).
I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.
But I noticed that their execution start and end timestamps overlap, so at least it works.
I will repeat this with less demanding shaders later - guess it becomes a win then.
 
What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).
Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.

 

Welcome to the world of hardware differences and the lies they tell :)

So, NV.. ugh.. NV are basically a massive black box because unless you are an ISV of standing (I guess?) they pretty much don't tell you anything about how the important bits of their hardware work, which is a pain and leads to people trying to figure out wtf is going on when their hardware starts to run slowly.

The first thing we learn is that not all queues are created equally; even if you can create 16 queues which can consume all the commands this doesn't indicate how well it will execute. The point of contention doesn't in fact seem to be the CUDA or CU cores (NV and AMD respectively) but the front end dispatcher logic, at least in NV's case.

A bit of simplified GPU theory; Your commands, when submitted to the GPU are dispatched in order the front end command processor sees them. This command processor can only keep so many work packets in flight before it has to wait for resources. So, for example, it might be able to keep 10 'draw' packets in flight but if you submit an 11th, even if you have a CUDA/CU unit free which could do the work the command processor can't dispatch the work to it until it has free resources to track said work. Also, iirc, work is retired in-order : so if 'draw' packet '3' finishes before '1' then the command processor can still be blocked from dispatching more work out to the ALU segment of the GPU.

On anything pre-Pascal I would just avoid trying to interleave Gfx and compute queue work at all; just go with a single queue as the hardware seems to have problem keeping work sanely in flight when mixing. By all accounts Pascal seems to do better, but I've not seen many net wins in benchmarks from it (at least, not to a significant amount) so even with Pascal you might want to default to a single queue.
(Pre-emption doesn't really help with this either; that deals with the ability to swap state out so that other work can take over and it is a heavy operation; CPU wise it is closer to switching processes with the amount of overhead. Pascal's tweak here is the ability to suspend work at instruction boundaries.)

AMD are a lot more open with how their stuff works, which is good for us :)
Basically all AMD GCN based cards in the wild today will have hardware for 1 'Graphics queue' (which can consume gfx, compute and copy commands) and 2 DMA queues which can consume copy commands only.
The compute queues are likely both driver and hardware dependant however. When Vulkan first appeared my R290 reported back only 2 compute queues; it now reports 7. However I'm currently not sure how that maps to hardware; while the hardware has 8 'async compute units' this doesn't mean that it is one queue per ACE as each ACE can service 8 'queues' of instructions themselves. (So, in theory, Vulkan on my AMD hardware could report 64 compute only queues, 8*8) If it is then life gets even more fun because each ACE can maintain two pieces of work at once meaning you could launch, resources allowing, 14 'compute' jobs + N gfx jobs and not have anything block.

When it comes to using this stuff, certainly in an async manner, then it is important to consider what is going on at the same time.

If your gfx task is bandwidth heavy but ALU light then you might have some ALU spare for some compute work to take advantage of - but if you pair ALU heavy with ALU heavy, or bandwidth heavy with bandwidth heavy you might see a performance dip.

Ultimately the best thing you can do is probably to make your graphics/compute setup data driven in some way so you can reconfigure things based on hardware type and factors like resolution of the screen etc.

I certainly, however, wouldn't try to drive the hardware from multiple threads in to the same window/view - that feels like a terrible idea and problems waiting to happen.
- Jobs to build command lists
- Master job(s) to submit built commands in the correct order to the correct queue
That would be my take on the setup.
 

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 
I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(


To a degree you'll have to do this if you want maximal performance; if you don't want max performance I would question your Vulkan usage, more so if you don't want to deal with the hardware awareness which comes with it :)

However, it doesn't have to be too bad if you can design your rendering/compute system in such as way that it is flexiable enough to cover the differences - at the simplest level you have a graph which indicates your gfx/compute work and on AMD you dispatch to 2 queues and on NV you serialise in to a single queue in the correct order. (Also, don't forget Intel in all this).

Your shaders and the parameters to them will require tweaking too; AMD prefer small foot prints in the 'root' data because they have limited register space to preload. NV on the other hand are good with lots of stuff in the 'root' signature of the shaders. You'll also likely want to take advantage of shader extensions too for maximal performance on the respective hardware. Edited by phantom

Share this post


Link to post
Share on other sites

Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

I might be a bit out of date with the latest cards, but I thought best practice was to use 1 general + 1 transfer on NVidia/Intel (and on intel you should think twice about utilising the transfer queue), and 1 general + 1 transfer + 1 compute on AMD to take advantage of their async compute feature.

Share this post


Link to post
Share on other sites
Yep, that's a good default position to take - compute of course depends on your workloads even with AMD but it's a good target to have as it means the spare ALU can be used even when the command processor has no more 'work slots' to hand out. (General feel is that AMD have less 'work slots' than NV on their graphics command processor, which is partly why NV have better performance with a single queue as you leave less ALU on the table by default, kind of a AMD can launch 5, NV can launch 10 thing.. I've just made those numbers up however, but you get the idea ;) )

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

However, as a rule, 1-1-1 AMD, 1-1 NV and potentially only gfx for Intel due to shared memory.

And being able to configure your app via data to start up in any of those modes is also a good idea for sanity reasons; ie do everything on the gfx queue to make sure things are sane with synchronisation before trying to introduce a compute queue.
(Same reason you'll want a Fence All The Things! mode for debug/sanity reasons)

Share this post


Link to post
Share on other sites

Some relevant things covered here: http://gpuopen.com/vulkan-and-doom/

 

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

 

Let us know if you hear something... :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Announcements

  • Forum Statistics

    • Total Topics
      628370
    • Total Posts
      2982298
  • Similar Content

    • By dwatt
      I am trying to get vulkan on android working and have run into a big issue. I don't see any validation layers available. I have tried linking their libraries into mine but still no layers. I have tried compiling it into a library of my own but the headers for it are all over the place. Unfortunately , google's examples and tutorials are out of date and don't work for me. Any idea what I have to do to get those layers to work?
    • By mark_braga
      It seems like nobody really knows what is the correct behavior after window minimizes in Vulkan.
      I have looked at most of the examples (Sascha Willems, GPUOpen,...) and all of them crash after the window minimize event with the error VK_ERROR_OUT_OF_DATE either with an assertion during acquire image or after calling present. This is because we have to recreate the swap chain.
      I tried this but then Vulkan expects you to provide a swap chain with extents { 0, 0, 0, 0 }, but now if you try to set the viewport or create new image views with extents { 0, 0, 0, 0 }, Vulkan expects you to provide non-zero values. So now I am confused.
      Should we just do nothing after a window minimize event? No rendering, update, ...?
    • By mellinoe
      Hi all,
      First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
      Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
      The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
      Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
    • By GuyWithBeard
      Hi,
      In Vulkan you have render passes where you specify which attachments to render to and which to read from, and subpasses within the render pass which can depend on each other. If one subpass needs to finish before another can begin you specify that with a subpass dependency.
      In my engine I don't currently use subpasses as the concept of the "render pass" translates roughly to setting a render target and clearing it followed by a number of draw calls in DirectX, while there isn't really any good way to model subpasses in DX. Because of this, in Vulkan, my frame mostly consists of a number of render passes each with one subpass.
      My question is, do I have to specify dependencies between the render passes or is that needed only if you have multiple subpasses?
      In the Vulkan Programming Guide, chapter 13 it says: "In the example renderpass we set up in Chapter 7, we used a single subpass with no dependencies and a single set of outputs.”, which suggests that you only need dependencies between subpasses, not between render passes. However, the (excellent) tutorials at vulkan-tutorial.com have you creating a subpass dependency to "external subpasses" in the chapter on "Rendering and presentation", under "Subpass dependencies": https://vulkan-tutorial.com/Drawing_a_triangle/Drawing/Rendering_and_presentation even if they are using only one render pass with a single subpass.
      So, in short; If I have render pass A, with a single subpass, rendering to an attachment and render pass B, also with a single subpass, rendering to that same attachment, do I have to specify subpass dependencies between the two subpasses of the render passes, in order to make render pass A finish before B can begin, or are they handled implicitly by the fact that they belong to different render passes?
      Thanks!
    • By mark_braga
      I am looking at the SaschaWillems subpass example for getting some insight into subpass depdendencies but its hard to understand whats going on without any comments. Also there is not a lot of documentation on subpass dependencies overall.
      Looking at the code, I can see that user specifies the src subpass, dst subpass and src state, dst state. But there is no mention of which resource the dependency is on. Is a subpass dependency like a pipeline barrier. If yes, how does it issue the barrier? Is the pipeline barrier issued on all attachments in the subpass with the input src and dst access flags? Any explanation will really clear a lot of doubts on subpass dependencies.
      Thank you
  • Popular Now