• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      Hi, I am having problems with all of my compute shaders in Vulkan. They are not writing to resources, even though there are no problems in the debug layer, every descriptor seem correctly bound in the graphics debugger, and the shaders definitely take time to execute. I understand that this is probably a bug in my implementation which is a bit complex, trying to emulate a DX11 style rendering API, but maybe I'm missing something trivial in my logic here? Currently I am doing these:
      Set descriptors, such as VK_DESCRIPTOR_TYPE_STORAGE_BUFFER for a read-write structured buffer (which is non formatted buffer) Bind descriptor table / validate correctness by debug layer Dispatch on graphics/compute queue, the same one that is feeding graphics rendering commands.  Insert memory barrier with both stagemasks as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT and srcAccessMask VK_ACCESS_SHADER_WRITE_BIT to dstAccessMask VK_ACCESS_SHADER_READ_BIT Also insert buffer memory barrier just for the storage buffer I wanted to write Both my application behaves like the buffers are empty, and Nsight debugger also shows empty buffers (ssems like everything initialized to 0). Also, I tried the most trivial shader, writing value of 1 to the first element of uint buffer. Am I missing something trivial here? What could be an other way to debug this further?
       
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

      View full story
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.
    • By mark_braga
      I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
      In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
      In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
      Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
      You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
      Now comes the command buffer submission part that is even more confusing.
      There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
      So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
      For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
      The same applies to VkDeviceGroupSubmitInfoKHR?
      Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
      void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }  
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
  • Advertisement
  • Advertisement
Sign in to follow this  

Vulkan Vulkan queue's and how to handle them

This topic is 521 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Been playing around with Vulkan a bit lately, got the required 'hello triangle' program working.  In all the demos/examples I've seen so far they only use 1 or 2 VkQueue's, usually one for graphics/compute and one for transfers.  Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

 

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue.  I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.  So there's definitely a large spread of capabilities which leaves me a bit unsure at how to properly utilize these queues.

 

The easiest solution is just to use one queue and be done with it, but this seems a real waste of Vulkan's capabilities.  I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do.  Being Vulkan my feeling is that that sort of stuff would be left to us programmers, and that most Vulkan drivers would tend towards the minimalistic side.

 

The second solution would be to try to create a unique VkQueue for each rendering thread.  Since issuing any commands on a queue require host synchronization, this maps nicely to a multi-threaded rendering solution.  Each thread gets its own queue, each its own command pool.  Each thread can render, transfer, or whatever all independently, and the final presentation is then done by the master thread when everything is ready.  Seems like a great solution to me, but only the NVidia GPUs have enough extra queues to go around.

 

A third option would be to write separate rendering  pipelines for each card type.  Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 

A fourth idea was to create a sort of VkQueue pool, where each rendering thread could pull a VkQueue when it needs it, returning it when done.  This would work, could lead to possible contention in the case of only 1 queue, undermining any multi-threaded benefits.

 

There's also a half dozen other minor variations of the 4 options above...  At this point I'm not really sure how the Vulkan committee envisioned us using the queue's.  How are you guys handling this situation?

Share this post


Link to post
Share on other sites
Advertisement

I'm using just a single queue. I think it's fine - if you try and split up a frames worth of rendering across multiple queues you're going to have a hell of a time synchronising up everything between those queues.

 

I think the rule of thumb should be that you'd only use multiple queues in situations where you want the work to be performed at different rates and/or with different priorities. For example, you might be using the GPU to decompress textures on the fly (like Rage say). The timing of that work has a degree of separation from the main business of rendering a frame, so it might make sense for it to be done on a different queue. I don't think there'd be big gains to spreading out a single frame's worth of work across multiple queues.

 

"I guess its theoretically possible that a GPU/driver could reorder command submissions to obtain some level of parallelization, but I don't know if any actually do." - Even using a single queue, you're still going to get an enormous amount of parallelization on pretty much every task you perform. The advantage of multiple queues is that it gives the scheduler an option to be clever in the event that there are bubbles that are preventing you from keeping the cores busy. I think multiple queues are slightly analogous to hyperthreading behaviour on CPUs, they give the hardware an opportunity to go off and do something else when one task is stalled, to stretch the analogy, imagine your GPU is like a CPU with 128 cores, it's not like using a single queue will utilize only a single core, it's more like you're using all 128 cores, but have disabled the hyperthreading on them.

 

Edit: This might be useful: http://stackoverflow.com/questions/37575012/should-i-try-to-use-as-many-queues-as-possible

Edited by C0lumbo

Share this post


Link to post
Share on other sites

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.

 

AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies

Early drivers supported 8 compute queues, so this may change.

 

The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.

AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).

I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.

But I noticed that their execution start and end timestamps overlap, so at least it works.

I will repeat this with less demanding shaders later - guess it becomes a win then.

 

What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).

Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.

 

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 

I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(

Share this post


Link to post
Share on other sites

Thanks for the input guys.  Another question about VkCommandPool and thread pools.  Having 1 pool per thread seems pretty straightforward if all the threads are synchronized every frame (create command buffers, synchronize, submit, release buffers, repeat).  But if we want to hang on to some of the command buffers for reuse, things become a little more problematic.

 

Do you just not reuse command buffers (all command buffers are submit once and release)?  Do you reuse command buffers only sparingly and for very specific situations(single thread pre-build them or something similar)?  Do you not use thread pools and instead just use a dedicated/hardcoded threading model (ie. thread 1 does X, thread 2 does Y, etc...)?  Do you try and send the reused command buffers (when done with them) back to the creating thread for release, or wrap the command pool in a mutex or some other synchronization?  Are command pools lightweight enough to have 1 or 2 command buffers per pool, and just pass them as a group between the threads?

Share this post


Link to post
Share on other sites

I generate per frame command buffers only for debug output.

 

Anything serious is: Create command buffers once with indirect dispatch commands, at runtime fill the dispatch buffer from a compute shader (e.g. doing frustum culling).

This makes the whole idea of multithreaded rendering obsolete - per frame the CPU only needs to submit the same command buffers so there is no point to use multiple threads.

 

I think this approach can scale up well to a complex graphics engine with very few exceptions.

E.g. if you want to keep occlusion culling on CPU, per frame this only results in a buffer upload to identify visible stuff. No new command buffer is necessary.

Share this post


Link to post
Share on other sites

On my gtx 970 I have 16 queue's with full support, and 1 transfer only queue. I read somewhere that AMD had only 1 queue with full support, 1-4 compute only queue's, and 2 transfer queue's and intel GPUs only have 1 queue for everything.

 
AMD has 1 graphics / compute queue and 3 compute only queues + 2 x transfer - you can lookup this stuff here for various GPUs: http://vulkan.gpuinfo.org/displayreport.php?id=700#queuefamilies
Early drivers supported 8 compute queues, so this may change.
 
The big number of 16 does not mean there is native hardware support if we talk about e.g. processing 16 different compute tasks simultaneously.
AFAIK AMDs async compute is still better (GCN can execute work from different dispatches on one CU or at least on different CUs, Pascal can only do preemption? Not sure about it).
I did some experiments with async compute using multiple queues on AMD but it was just a small loss. Probably because my shaders were totaly saturating the GPU so there is no point to do them async.
But I noticed that their execution start and end timestamps overlap, so at least it works.
I will repeat this with less demanding shaders later - guess it becomes a win then.
 
What i hear anywhere is that we should do compute work while rendering shadow maps or early z - at least one example where we need to use multiple queues (beside obvious data transfer).
Personally i don't think using one queue per render thread makes a lot of sense. After multiple command buffers has been created we can just use one thread to submit them.

 

Welcome to the world of hardware differences and the lies they tell :)

So, NV.. ugh.. NV are basically a massive black box because unless you are an ISV of standing (I guess?) they pretty much don't tell you anything about how the important bits of their hardware work, which is a pain and leads to people trying to figure out wtf is going on when their hardware starts to run slowly.

The first thing we learn is that not all queues are created equally; even if you can create 16 queues which can consume all the commands this doesn't indicate how well it will execute. The point of contention doesn't in fact seem to be the CUDA or CU cores (NV and AMD respectively) but the front end dispatcher logic, at least in NV's case.

A bit of simplified GPU theory; Your commands, when submitted to the GPU are dispatched in order the front end command processor sees them. This command processor can only keep so many work packets in flight before it has to wait for resources. So, for example, it might be able to keep 10 'draw' packets in flight but if you submit an 11th, even if you have a CUDA/CU unit free which could do the work the command processor can't dispatch the work to it until it has free resources to track said work. Also, iirc, work is retired in-order : so if 'draw' packet '3' finishes before '1' then the command processor can still be blocked from dispatching more work out to the ALU segment of the GPU.

On anything pre-Pascal I would just avoid trying to interleave Gfx and compute queue work at all; just go with a single queue as the hardware seems to have problem keeping work sanely in flight when mixing. By all accounts Pascal seems to do better, but I've not seen many net wins in benchmarks from it (at least, not to a significant amount) so even with Pascal you might want to default to a single queue.
(Pre-emption doesn't really help with this either; that deals with the ability to swap state out so that other work can take over and it is a heavy operation; CPU wise it is closer to switching processes with the amount of overhead. Pascal's tweak here is the ability to suspend work at instruction boundaries.)

AMD are a lot more open with how their stuff works, which is good for us :)
Basically all AMD GCN based cards in the wild today will have hardware for 1 'Graphics queue' (which can consume gfx, compute and copy commands) and 2 DMA queues which can consume copy commands only.
The compute queues are likely both driver and hardware dependant however. When Vulkan first appeared my R290 reported back only 2 compute queues; it now reports 7. However I'm currently not sure how that maps to hardware; while the hardware has 8 'async compute units' this doesn't mean that it is one queue per ACE as each ACE can service 8 'queues' of instructions themselves. (So, in theory, Vulkan on my AMD hardware could report 64 compute only queues, 8*8) If it is then life gets even more fun because each ACE can maintain two pieces of work at once meaning you could launch, resources allowing, 14 'compute' jobs + N gfx jobs and not have anything block.

When it comes to using this stuff, certainly in an async manner, then it is important to consider what is going on at the same time.

If your gfx task is bandwidth heavy but ALU light then you might have some ALU spare for some compute work to take advantage of - but if you pair ALU heavy with ALU heavy, or bandwidth heavy with bandwidth heavy you might see a performance dip.

Ultimately the best thing you can do is probably to make your graphics/compute setup data driven in some way so you can reconfigure things based on hardware type and factors like resolution of the screen etc.

I certainly, however, wouldn't try to drive the hardware from multiple threads in to the same window/view - that feels like a terrible idea and problems waiting to happen.
- Jobs to build command lists
- Master job(s) to submit built commands in the correct order to the correct queue
That would be my take on the setup.
 

A third option would be to write separate rendering pipelines for each card type. Write one solution for NVidia, one for AMD, etc... but this isn't something I'd like to do.

 
I think you should. I'm still on AMD only at the moment but in the past i've had some differences just in compute shaders between NV / AMD resulting in +/-50% performance. I expect the same applies to almost anything else especially with a more low level API :(


To a degree you'll have to do this if you want maximal performance; if you don't want max performance I would question your Vulkan usage, more so if you don't want to deal with the hardware awareness which comes with it :)

However, it doesn't have to be too bad if you can design your rendering/compute system in such as way that it is flexiable enough to cover the differences - at the simplest level you have a graph which indicates your gfx/compute work and on AMD you dispatch to 2 queues and on NV you serialise in to a single queue in the correct order. (Also, don't forget Intel in all this).

Your shaders and the parameters to them will require tweaking too; AMD prefer small foot prints in the 'root' data because they have limited register space to preload. NV on the other hand are good with lots of stuff in the 'root' signature of the shaders. You'll also likely want to take advantage of shader extensions too for maximal performance on the respective hardware. Edited by phantom

Share this post


Link to post
Share on other sites

Now its my understanding that to get the most out of modern GPUs (especially going forward) is that we want to use multiple queue's as much as possible when processing commands that are asynchronous.

I might be a bit out of date with the latest cards, but I thought best practice was to use 1 general + 1 transfer on NVidia/Intel (and on intel you should think twice about utilising the transfer queue), and 1 general + 1 transfer + 1 compute on AMD to take advantage of their async compute feature.

Share this post


Link to post
Share on other sites
Yep, that's a good default position to take - compute of course depends on your workloads even with AMD but it's a good target to have as it means the spare ALU can be used even when the command processor has no more 'work slots' to hand out. (General feel is that AMD have less 'work slots' than NV on their graphics command processor, which is partly why NV have better performance with a single queue as you leave less ALU on the table by default, kind of a AMD can launch 5, NV can launch 10 thing.. I've just made those numbers up however, but you get the idea ;) )

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

However, as a rule, 1-1-1 AMD, 1-1 NV and potentially only gfx for Intel due to shared memory.

And being able to configure your app via data to start up in any of those modes is also a good idea for sanity reasons; ie do everything on the gfx queue to make sure things are sane with synchronisation before trying to introduce a compute queue.
(Same reason you'll want a Fence All The Things! mode for debug/sanity reasons)

Share this post


Link to post
Share on other sites

Some relevant things covered here: http://gpuopen.com/vulkan-and-doom/

 

I should probably ask someone (either at work or just from AMD) how command queues map to hardware with Vulkan for compute; if it runs on the same ACE then more then you hit the 'work slot' limit with two compute workloads in flight at once, but if it is mapped across them then you could have up to 14 independent things going at once. Would be nice to know the balance.

 

Let us know if you hear something... :)

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement