• 9
• 9
• 11
• 13
• 9
• ### Similar Content

• I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
Now comes the command buffer submission part that is even more confusing.
There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
The same applies to VkDeviceGroupSubmitInfoKHR?
Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.

• I publishing for manufacturing our ray tracing engines and products on graphics API (C++, Vulkan API, GLSL460, SPIR-V): https://github.com/world8th/satellite-oem
For end users I have no more products or test products. Also, have one simple gltf viewer example (only source code).
In 2016 year had idea for replacement of screen space reflections, but in 2018 we resolved to finally re-profile project as "basis of render engine". In Q3 of 2017 year finally merged to Vulkan API.

• vkQueuePresentKHR is busy waiting - ie. wasting all the CPU cycles while waiting for vsync. Expected, sane, behavior would of course be akin to Sleep(0) till it can finish.
Windows 7, GeForce GTX 660.
Is this a common problem? Is there anything i can do to make it behave properly?

• I am working on reusing as many command buffers as I can by pre-recording them at load time. This gives a significant boost on CPU although now I cannot get the GPU timestamps since there is no way to read back. I Map the readback buffer before and Unmap it after reading is done. Does this mean I need a persistently mapped readback buffer?
void Init() { beginCmd(cmd); cmdBeginQuery(cmd); // Do a bunch of stuff cmdEndQuery(cmd); endCmd(cmd); } void Draw() { CommandBuffer* cmd = commands[frameIdx]; submit(cmd); } The begin and end query do exactly what the names say.

# Vulkan Vulkan Queues

This topic is 428 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I'm working my way through some vulkan examples, and I just want to make sure that my understanding of queues is correct.

Each physical device has a set ($$Q$$) of queue families, and each queue family has some amount $$N_{q \in Q}$$ queues that can be used. When creating a logical device, you specify some amount of logical queues you want to create, such that the sum of each queueCount property for each queue family $$q$$ is not greater than $$N_q$$ (maybe some unforseen circumstance has led us to a VkDeviceQueueCreateInfo array of [(family=0,cnt=2), (family=1,cnt=1), (family=0,cnt=1)]). The driver will take care of multiplexing the queues, e.g. if I create two logical devices with 8 queues each from the same queue family, whether or not the driver assigns both logical queues $$0, \dots, 7$$ to physical queues $$0, \dots 7$$ or $$0, \dots, 15$$ is none of my concern, the plumbing is all taken care of by the driver.

Different queues can be submitted in paralle, but extra safety should be taken care of to make sure that the command buffers don't screw with each other if they interact with the same object. Retrieving queues gets retrieved in the order created. e.g. If my VkDeviceQueueCreateInfo array looked like [(cnt=2,priorities=[1.0,0.5]), (cnt=1,priorities=[0.2])], I can expect that vkGetDeviceQueue(device, family, [0, 1, 2]) has priorities [1.0, 0.5, 0.2]). Queue priorities are a relative number, such that the following metaphor makes sense: if each queue can be represented as a thread, a priority of 1.0 means that the thread should work as hard as it possibly can while a priority of 0.5 means that it should only work half as hard, with the union of all threads representing the queue processing power of the entire physical device.

If I said something wrong, please feel free to correct me. I want to make sure I'm not misunderstanding something fundamental.

##### Share on other sites

if I create two logical devices with 8 queues each from the same queue family

Create multiple devices from one physical device? Interesting idea. Are there any possible advantages? Why would you do this?

I can't spot anything wrong with what you say, i'm no expert, but i can share some experience:

I tried various numbers for the priorities for async compute on AMD, but IIRC the effect was either nor measureable or a slight loss - ended up using 1 for anything. Needs more testing.

Looking at my log we can be quite sure that VK queues do not match to hardware queues in any way:
found GPUs: 2

deviceName: GTX 670
apiVersion: 4194328
driverVersion: 1577369600
Queue family 0 (16 queues): graphics: 1 compute: 1 transfer: 1 sparse: 1
Queue family 1 (1 queues): graphics: 0 compute: 0 transfer: 1 sparse: 0

deviceName: AMD Radeon (TM) R9 Fury Series
apiVersion: 4194341
driverVersion: 4210689
Queue family 0 (1 queues): graphics: 1 compute: 1 transfer: 1 sparse: 1
Queue family 1 (3 queues): graphics: 0 compute: 1 transfer: 1 sparse: 1
Queue family 2 (2 queues): graphics: 0 compute: 0 transfer: 1 sparse: 1

The AMD card has less queues in VK than in hardware, the NV card has lots ot VK queues but no async in hardware at all.

Interesting results i got from AMD:
* Doing graphics and compute in the same family 0 queue is faster than using a second compute queue from family 1. (When still doing all sequentially - did not try graphics and compute async).
* Async compute requires using multiple queues and command buffers, but even without synchronization there is some gap between command buffer execution, it may likely be large enough to eliminate the advantage of async :(

This gap between multiple command buffers happens also with only a single queue, and also on Nvidia.

Conclusion: You need a very good reason to use multiple queues / multiple command buffers.

(I work only on compute, can't say anything about graphics and if it makes a difference)

##### Share on other sites

Create multiple devices from one physical device? Interesting idea. Are there any possible advantages? Why would you do this?

This is a what if scenario, no real reason behind it.

Thanks for sharing your experiences though. I guess I'll stick with just creating one queue for now.