# Vulkan indirect drawing and dynamic uniform buffer

## Recommended Posts

hello.
I'm trying to create an indirect drawing with vulkan.
All works fine but now i would create some states variables to send per objects drawing.
For example : if i have 3 cubes and i would change the color only at the first, how i can do?
The dynamic uniform buffers are made for this?
I read that i can change only one state variable how i do?
I must understand how i can use dynamic uniform buffers for change the state only to one or more objects without remap all the data buffer.

thanks.

##### Share on other sites

hello.
I understand in part the dinamyc buffers.
in my program i have a dynamic buffer with only one matrix ,but how i can copy only some matrixes to the gpu instead copy the entire buffer?.
here from an example of vulkan sdk

void updateDynamicUniformBuffer(bool force = false)
{
// Update at max. 60 fps
animationTimer += frameTimer;
if ((animationTimer <= 1.0f / 60.0f) && (!force)) {
return;
}

// Dynamic ubo with per-object model matrices indexed by offsets in the command buffer
uint32_t dim = static_cast<uint32_t>(pow(OBJECT_INSTANCES, (1.0f / 3.0f)));
glm::vec3 offset(5.0f);

for (uint32_t x = 0; x < dim; x++)
{
for (uint32_t y = 0; y < dim; y++)
{
for (uint32_t z = 0; z < dim; z++)
{
uint32_t index = x * dim * dim + y * dim + z;

// Aligned offset
glm::mat4* modelMat = (glm::mat4*)(((uint64_t)uboDataDynamic.model + (index * dynamicAlignment)));

// Update rotations
rotations[index] += animationTimer * rotationSpeeds[index];

// Update matrices
glm::vec3 pos = glm::vec3(-((dim * offset.x) / 2.0f) + offset.x / 2.0f + x * offset.x, -((dim * offset.y) / 2.0f) + offset.y / 2.0f + y * offset.y, -((dim * offset.z) / 2.0f) + offset.z / 2.0f + z * offset.z);
*modelMat = glm::translate(glm::mat4(), pos);
*modelMat = glm::rotate(*modelMat, rotations[index].x, glm::vec3(1.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].y, glm::vec3(0.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].z, glm::vec3(0.0f, 0.0f, 1.0f));
}
}
}

animationTimer = 0.0f;

isFilledDynamicBuffer = true;

// Flush to make changes visible to the host
VkMappedMemoryRange memoryRange = vkTools::initializers::mappedMemoryRange();
memoryRange.memory = uniformBuffers.dynamic.memory;
memoryRange.size = uniformBuffers.dynamic.size;

vkFlushMappedMemoryRanges(device, 1, &memoryRange);
}


What can i use instead of memcpy(uniformBuffers.dynamic.mapped, uboDataDynamic.model, uniformBuffers.dynamic.size); for update of only some matrixes instead of  the entire buffer?
Thanks.
2)can I use the dynamic buffers with vulkan indirectdraw?
thanks.

##### Share on other sites

how i can copy only some matrixes to the gpu instead copy the entire buffer?.

Of course you can change only some matrices in the mapped memory, but you still need to upload the whole memory between them (or use multiple uploads to exclude some large memory regions where no changes occur).

An alternative would be to upload only the changed matrices plus an array identifying their target indices in a small seperate buffer, and use a compute shader to copy the matrices to their correct location on GPU.

Render 2 million cubes, each with its own transformation matrix (FPS around 60 IIRC).

Each vertex had an integer index to its matrix in the w component of its position, matrices simply stored in a single large buffer. (There was no performance difference between using Uniform or Stroage Buffer for this.)

This way i can draw all cubes with one draw, similar to a skinned mesh. I have not tested if this is faster than using 2 million draws and identifying the proper matrix per object.

I assume it depends on vertex count per object and expect those differences:

Per Draw: Matrix can be copied to some fast constants read only memory before doing the draw.

Per Vertex: Each vertex needs to fetch it's own matrix, probably in cache.

But i do not really know. Maybe some graphics guru can share some insights about this... :)

##### Share on other sites

use a compute shader to copy the matrices to their correct location on GPU.

in pratical instead of tests that i'm doing the real application must have a piking that change the color of the selected object, a visible / invisible bool integer and a color that may be change sometime in indirectdraw .
for the piking that is the most important can you advice to me how do with a compute shader?there are architectural or simple examples for the picking and bounding boxes?
thanks

##### Share on other sites

So you have a number of unordered objects on screen and the user picks one with a mouse cursor or crosshair?

The fastest way would be raytracing (e.g. for a crosshair this would be simply the front vector of the camera transform).

If you have really many objects you might wanna use a hirarchy, octree or whatever to speed things up but for a single ray this should not be necessary.

A typical ray - axis aligned bounding box test looks like this:

    bool TestRayAABox (const Ray& ray, const AABox& box)
{
// returns false if ray is parallel with box face
qVec3 t0 = qVec3(box.minmax[0] - ray.origin).MulPerElem (ray.invDirection);
qVec3 t1 = qVec3(box.minmax[1] - ray.origin).MulPerElem (ray.invDirection);
qVec3 tMin = t0.MinPerElem (t1);
qVec3 tMax = t0.MaxPerElem (t1);
qScalar ffd = tMin.MaxElem(); // front face distance (behind origin if inside box)
qScalar bfd = tMax.MinElem(); // back face distance

return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
}

bool IntersectRayAABox (const Ray& ray, const AABox& box, float& t)
{
qScalar ffd, bfd;
DistanceRayAABoxFrontAndBackface (ffd, bfd, ray, box);
//RenderPoint (3, qVec3(ray.origin + ray.direction * ffd), 1,0,0);
//RenderPoint (3, qVec3(ray.origin + ray.direction * bfd), 0,1,0);

t = (ffd > 0) ? ffd : bfd; // returns always the first intersection with a face where point of intersection is ray.origin + ray.direction * t
return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
}



box.minmax[0] is minimum 3D box position and [1] is the maximum.

ray.invDirection is 1.0 / ray unit direction

This is a cpu simd implementation, all those member functions from qVec3 should be self explaining and are available also for GLSL vectors.

You would just iterate over all boxes and keep the closest hit.

You can do this with compute shaders, but for only a single ray doing it on CPU would make more sense even with 1000 boxes. GPU is good for thousands of rays, or a huge number of boxes (100000?) in parallel.

If you look for introduction to compute shaders i recommend the chapter from OpenGL Super Bible.

Few pages there are enough not only to explain the tech but more importantly what's the real catch and power of parallel programming. (But be warned they have some barriers in the wrong order. memoryBarrierShared() followed by barrier() is correct.)

Vulkan changed only the things necessary on CPU side, shaders themselves stay the same. Sascha Willems code on github has examples for this in Vulkan.

##### Share on other sites
thanks you are very kind.
Only 2 questions and i start reading:
1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.
2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?
thanks now i start to reading the superbible.
hello.

use a compute shader to copy the matrices to their correct location on GPU.

what mean this? i don't understand. Edited by giugio1977

##### Share on other sites

1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.

I assume you want to check the bounding boxes of all 100000 objects aginst a single picking ray.

You can brute force this on GPU. Say the GPU has 2000 threads, then each one would only do 50 tests.

So this works well, but only if you have (and animate) the box data already on GPU. If you need to upload the whole box data from system memory each frame, the upload would take longer than the tests.

I guess it's still doable depending on your needs, but if only a few objects move per frame you should indeed upload only the changes:

use a compute shader to copy the matrices to their correct location on GPU.

E.g. you have 5 objects on GPU and we write them by array index and the data as a letter:

(0:a),(1:b),(2:c),(3:d),(4:e)

Now you decide on CPU that object 0 and 3 have to change their letter.

You would upload this compact data:

(0:x),(3:y)

Then you run a compute shader that copies the uploaded data to the original locations, effectively updating the data, so you have:

(0:x),(1:b),(2:c),(3:y),(4:e)

Easy, but that's what i mean. Limits the cost of the expensive upload operation.

2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?

No idea what you mean here - language problems :)

But one thing: It's unlikely you keep using the same memory layout for your data on GPU.

E.g. you have on CPU:

struct object

{

vec boxMin, vec boxMax;

mat4x4 transform;

} objects[100000];

You likely end up on GPU with something like:

vec3 boxMin[100000];

vec3 boxMax[100000];

vec3 pos[100000];

vec4 quat[100000]; // if you decide quaternion instead matrix can save bandwidth and GPU registers

... so more the SOA than the AOS way.

This is because it's likely all GPU threads read boxMin at the same time, so having them close together is better even if in the next instruction reads boxMax.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628730
• Total Posts
2984431
• ### Similar Content

• By L. Spiro
Home: https://www.khronos.org/vulkan/
SDK: http://lunarg.com/vulkan-sdk/

AMD drivers: http://gpuopen.com/gaming-product/vulkan/ (Note that Vulkan support is now part of AMD’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
NVIDIA drivers: https://developer.nvidia.com/vulkan-driver (Note that Vulkan support is now part of NVIDIA’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
Intel drivers: http://blogs.intel.com/evangelists/2016/02/16/intel-open-source-graphics-drivers-now-support-vulkan/

Quick reference: https://www.khronos.org/registry/vulkan/specs/1.0/refguide/Vulkan-1.0-web.pdf
References: https://www.khronos.org/registry/vulkan/specs/1.0/apispec.html
https://matthewwellings.com/blog/the-new-vulkan-coordinate-system/

GLSL-to-SPIR-V: https://github.com/KhronosGroup/glslang

Sample code: https://github.com/LunarG/VulkanSamples
https://github.com/SaschaWillems/Vulkan
https://github.com/nvpro-samples
https://github.com/nvpro-samples/gl_vk_chopper
https://github.com/nvpro-samples/gl_vk_supersampled
https://github.com/McNopper/Vulkan
https://github.com/GPUOpen-LibrariesAndSDKs/HelloVulkan

C++: https://github.com/nvpro-pipeline/vkcpp
https://developer.nvidia.com/open-source-vulkan-c-api

Getting started: https://vulkan-tutorial.com/
https://renderdoc.org/vulkan-in-30-minutes.html
https://www.khronos.org/news/events/vulkan-webinar
https://developer.nvidia.com/engaging-voyage-vulkan
https://developer.nvidia.com/vulkan-memory-management
https://developer.nvidia.com/opengl-vulkan
https://github.com/vinjn/awesome-vulkan

Utilities: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator (AMD Memory allocator.)
https://github.com/GPUOpen-LibrariesAndSDKs/Anvil (AMD Miniature Vulkan engine/framework.)

L. Spiro
• By hiya83
(Posted this in graphics forum too, which was perhaps the wrong forum for it)
Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?
Thanks!
• By hiya83
Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?

• I am working on a project which needs to share render targets between Vulkan and DirectX12. I have enabled the external memory extension and now allocate the memory for the render targets by adding the VkExportMemoryInfoKHR to the pNext chain of VkMemoryAllocateInfo. Similarly I have added the VkExternalMemoryImageCreateInfo to the pNext chain of VkImageCreateInfo.
After calling the get win32 handle function, I get some handle pointer which is not null (I assume it is valid).
VkExternalMemoryImageCreateInfoKHR externalImageInfo = {}; if (gExternalMemoryExtensionKHR) { externalImageInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO_KHR; externalImageInfo.pNext = NULL; externalImageInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KH imageCreateInfo.pNext = &externalImageInfo; } vkCreateImage(...); VkExportMemoryAllocateInfoKHR exportInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; exportInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; memoryAllocateInfo.pNext = &exportInfo; vkAllocateMemory(...); VkMemoryGetWin32HandleInfoKHR info = { VK_STRUCTURE_TYPE_MEMORY_GET_WIN32_HANDLE_INFO_KHR, NULL }; info.memory = pTexture->GetMemory(); info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; VkResult res = vkGetMemoryWin32HandleKHR(vulkanDevice, &info, &pTexture->pSharedHandle); ASSERT(VK_SUCCESS == res); Now when I try to call OpenSharedHandle from a D3D12 device, it crashes inside nvwgf2umx.dll with the integer division by zero error.
I am now lost and have no idea what the other handle types do.
For example: How do we get the D3D12 resource from the VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR handle?
I also found some documentation on this link but it doesn't help much.