• Advertisement

• ### Popular Now

• 11
• 14
• 13
• 10
• 11
• Advertisement
• ### Similar Content

• By khawk
LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

View full story
• By khawk
LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

• I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
Now comes the command buffer submission part that is even more confusing.
There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
The same applies to VkDeviceGroupSubmitInfoKHR?
Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }

• Hi,
I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.

• I publishing for manufacturing our ray tracing engines and products on graphics API (C++, Vulkan API, GLSL460, SPIR-V): https://github.com/world8th/satellite-oem
For end users I have no more products or test products. Also, have one simple gltf viewer example (only source code).
In 2016 year had idea for replacement of screen space reflections, but in 2018 we resolved to finally re-profile project as "basis of render engine". In Q3 of 2017 year finally merged to Vulkan API.

• Advertisement
• Advertisement

# Vulkan Changing a descriptor set's buffer memory every frame?

This topic is 695 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Recommended Posts

Following problem:
I have a bunch of meshes that need to be rendered in one batch (They're not the same, so I can't use instancing).
I've created a secondary command buffer, which does exactly that:

(PseudoCode)
VkCommandBuffer cmdSec = new SecondaryCommandBuffer;
int subPass = 0;
vkBeginCommandBuffer(cmdSec,COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE,renderPass,framebuffer,subPass);
vkCmdBindPipeline(cmdSec,pipeline);
foreach(mesh) {
vkCmdBindVertexBuffers(cmdSec,...);
vkCmdDraw(cmdSec);
}
vkEndCommandBuffer(cmdSec);


The secondary command buffer is later executed each frame from within the primary command buffer:
VkCommandBuffer cmdPrim = new PrimaryCommandBuffer;
vkBeginRenderPass(cmdPrim,renderPass,framebuffer,VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);
vkCmdExecuteCommands(cmdPrim,cmdSec);
vkEndRenderPass(cmdPrim);

So far so good. The problem is, to render the meshes, I also need to push some additional data (e.g. matrix) to the pipeline, and this data changes every frame.
Push constants are not an option, since they can't be used in a render pass with the VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS flag:

The contents parameter describes how the commands in the first subpass will be provided. If it is VK_SUBPASS_CONTENTS_INLINE, the contents of the subpass will be recorded inline in the primary command buffer, and calling a secondary command buffer within the subpass is an error. If contents is [background=#ffeb90]VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS[/background], the contents are recorded in secondary command buffers that will be called from the primary command buffer, and [background=#ffeb90]vkCmdExecuteCommands is the only valid command on the command buffer until vkCmdNextSubpass or vkCmdEndRenderPass.[/background]

(Source: https://www.khronos.org/registry/vulkan/specs/1.0/apispec.html#vkCmdBeginRenderPass)

That means my only(?) option is to use a descriptor set.
The idea is to bind the descriptor set inside the secondary command buffer recording, then update the descriptor set with the new data every frame, right before executing the secondary command buffer.

Now, I'm still new at this, so I'd like someone to confirm whether this is correct or not. There's a couple of things I have to take into account:
• Since the memory of the descriptor set's buffer changes every frame (=non-coherent) it has to be created without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag.

vkFlushMappedMemoryRanges must be used to guarantee that host writes to non-coherent memory are visible to the device. It must be called after the host writes to non-coherent memory have completed and before command buffers that will read or write any of those memory locations are submitted to a queue.

• vkFlushMappedMemoryRanges has to be called on the host, after the updated memory has been mapped.

Host-visible memory types that advertise the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT property still require memory barriers between host and device in order to be coherent, but do not require additional cache management operations to achieve coherency. For host writes to be seen by subsequent command buffer operations, a pipeline barrier from a source of VK_ACCESS_HOST_WRITE_BIT and VK_PIPELINE_STAGE_HOST_BIT to a destination of the relevant device pipeline stages and access types must be performed. Note that such a barrier is performed implicitly upon each command buffer submission, so an explicit barrier is only rarely needed (e.g. if a command buffer waits upon an event signaled by the host, where the host wrote some data after submission). For device writes to be seen by subsequent host reads, a pipeline barrier is required to make the writes visible.

• I'm not sure about this part. Since the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag isn't set, do I still need a pipeline barrier? (Or does vkFlushMappedMemoryRanges already take care of that?)
The result would be this:
(PseudoCode)
VkCommandBuffer cmdSec = new SecondaryCommandBuffer;
int subPass = 0;
vkBeginCommandBuffer(cmdSec,COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE,renderPass,framebuffer,subPass);
vkCmdBindPipeline(cmdSec,pipeline);
vkCmdBindDescriptorSets(descSet);
foreach(mesh) {
vkCmdBindVertexBuffers(cmdSec,...);
vkCmdDraw(cmdSec);
}
vkEndCommandBuffer(cmdSec);

vkMapMemory(descSetBufferMemory);
// Write data to mapped memory
vkUnmapMemory(descSetBufferMemory);
vkFlushMappedMemoryRanges(descSetBufferMemory);
VkCommandBuffer cmdPrim = new PrimaryCommandBuffer;
vkBeginCommandBuffer(cmdPrim);
// Pipeline Barrier?
vkBeginRenderPass(cmdPrim,renderPass,framebuffer,VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);
vkCmdExecuteCommands(cmdPrim,cmdSec);
vkEndRenderPass(cmdPrim);
vkEndCommandBuffer(cmdPrim);

Would that be correct so far?

Another thing I'm wondering about:
The memory of the buffer is updated and used by the pipeline every frame. What happens if a frame has been queued already, but not fully drawn, and I'm updating the buffer for the next frame already?
Would/Could that affect the queued frame? If so, could that be avoided with an additional barrier (source = VK_ACCESS_SHADER_READ_BIT, destination = VK_ACCESS_HOST_WRITE_BIT ("Wait for all shader reads to be completed before allowing the host to write"))?
Would it be better to use more than 1 buffer/descriptor set (+ more than 1 secondary command buffer), and swap between them each frame? If so, would 2 be enough (Even for mailbox present mode), or would I need as many as I have swapchain images?

I'd mostly just like to know if my general idea is correct, or if I'm missing/misinterpreting something.

#### Share this post

##### Share on other sites
Advertisement

That means my only(?) option is to use a descriptor set.
The idea is to bind the descriptor set inside the secondary command buffer recording, then update the descriptor set with the new data every frame, right before executing the secondary command buffer.

Just to get the terminology right - a descriptor set is a group of descriptors. A descriptor is a small structure that points to a resource.
You can either update a descriptor to point to a different resource, or just update the data within that existing resource.

Since the memory of the descriptor set's buffer changes every frame (=non-coherent) it has to be created without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT flag.

That's not what coherent/non-coherent means. Memory coherency means that two processors see the same version of events in memory. Coherency is an issue for multi-core CPU design too -- when one core writes to memory, that write might be stored in the core's cache for some time before actually reaching RAM. This means that other cores will see a non-coherent view of RAM. CPU manufacturers solve this by networking the cache of each CPU together, and following a coherency protocol, e.g. MESI.

By default, the CPU and GPU are not coherent because the CPU is accessing RAM via it's cache, and the GPU is accessing it directly -- so the GPU won't see any values that are lingering in the CPU's cache.
Programmers can achieve coherency themselves, via functions like vkFlushMappedMemoryRanges/etc (internally this is ensuring that the CPU's writes have actually reached RAM, and informs the GPU to invalidate any caches that it may be using).
Or, if your hardware supports it, some PC's are capable of auto-magically establishing a coherent view of RAM. For example, these systems may be able to route the GPU's RAM read request to flow via the CPU's L2 cache, so that the latest values are picked up without the need for any flushing/invalidation commands. The downside is that this will be a longer route, so the latency will be increased -- so coherent memory heaps are ok for things like command buffers or some constant updates, but not so good for textures :)

In your case, you should be able to put your data in coherent or non-coherent heaps, as long as you follow the guildelines to achieve coherency yourself via Flush/etc...

As for the barrier -- vkFlushMappedMemoryRanges occurs on the CPU timeline and flushes the CPU cache out to RAM. The barrier occurs on the GPU timeline and invalidates any values that already exist in the GPU's cache, so that it will actually fetch fresh values from RAM - but as in your quote, this happens already for each command buffer submission.

Another thing I'm wondering about:
The memory of the buffer is updated and used by the pipeline every frame. What happens if a frame has been queued already, but not fully drawn, and I'm updating the buffer for the next frame already?
Would/Could that affect the queued frame? If so, could that be avoided with an additional barrier (source = VK_ACCESS_SHADER_READ_BIT, destination = VK_ACCESS_HOST_WRITE_BIT ("Wait for all shader reads to be completed before allowing the host to write"))?
Would it be better to use more than 1 buffer/descriptor set (+ more than 1 secondary command buffer), and swap between them each frame? If so, would 2 be enough (Even for mailbox present mode), or would I need as many as I have swapchain images?

Whenever the CPU is updating data that will be used by the GPU, you need to take care as the GPU is usually one frame behind the CPU. This usually means double or even triple-buffering your data. This is usually achieved by creating two (or more) resources and binding a different one each frame. This would also mean creating two descriptor sets, and two of your secondary command buffers...
You also need to use two (or more) fences to make sure that the CPU/GPU don't get too far ahead of each other. e.g. for double buffering, at the start of frame N, you must first wait on the fence that tells you that the GPU has finished frame N-2.
Once you've implemented this fencing scheme, you can use this one mechanism to ensure safe access to all of your per-frame resources.
e.g. once you know for a fact that the GPU is only ever 1 frame behind the CPU, then any resource that's more than 1 frame old is safe for the CPU to recycle/overwrite/reuse... and anything younger than that must be treated as if it's still being used by the GPU...

So, if you want to edit your descriptor set, or edit the resources that it points to... you're not allowed to until the GPU has finished consuming them. You can solve this by double buffering as above -- two resources, so you can have two sets of values in flight... which means two descriptor sets in flight... which means pre-creating two versions of your command buffer :(

Alternatively, you can use a single descriptor set (not double-buffered, never updated) and a single resource (not double-buffered, but updated on the GPU timeline instead of the CPU timeline) :)
If these updates occur on the GPU timeline, then there's no need to double buffer the resource, which means there's no need for multiple descriptor sets.
However, this also introduces its own pitfalls... To perform this update on the GPU timeline, you now need the "main" version of the resource, which is referenced by the descriptor set and read by your shaders. You also need a double-buffered "upload" resource, which is written to by the CPU each frame. You then submit a command buffer that instructs the GPU to copy from (one of) the upload resources to the "main" resource.

Edited by Hodgman

#### Share this post

##### Share on other sites

• Advertisement