Sign in to follow this  

Vulkan indirect drawing and dynamic uniform buffer

Recommended Posts

hello.
I'm trying to create an indirect drawing with vulkan.
All works fine but now i would create some states variables to send per objects drawing.
For example : if i have 3 cubes and i would change the color only at the first, how i can do?
The dynamic uniform buffers are made for this?
I read that i can change only one state variable how i do?
I must understand how i can use dynamic uniform buffers for change the state only to one or more objects without remap all the data buffer.
 
thanks.

Share this post


Link to post
Share on other sites

hello.
I understand in part the dinamyc buffers.
in my program i have a dynamic buffer with only one matrix ,but how i can copy only some matrixes to the gpu instead copy the entire buffer?.
here from an example of vulkan sdk
 

void updateDynamicUniformBuffer(bool force = false)
{
// Update at max. 60 fps
animationTimer += frameTimer;
if ((animationTimer <= 1.0f / 60.0f) && (!force)) {
return;
}
 
// Dynamic ubo with per-object model matrices indexed by offsets in the command buffer
uint32_t dim = static_cast<uint32_t>(pow(OBJECT_INSTANCES, (1.0f / 3.0f)));
glm::vec3 offset(5.0f);
 
for (uint32_t x = 0; x < dim; x++)
{
for (uint32_t y = 0; y < dim; y++)
{
for (uint32_t z = 0; z < dim; z++)
{
uint32_t index = x * dim * dim + y * dim + z;
 
// Aligned offset
glm::mat4* modelMat = (glm::mat4*)(((uint64_t)uboDataDynamic.model + (index * dynamicAlignment)));
 
// Update rotations
rotations[index] += animationTimer * rotationSpeeds[index];
 
// Update matrices
glm::vec3 pos = glm::vec3(-((dim * offset.x) / 2.0f) + offset.x / 2.0f + x * offset.x, -((dim * offset.y) / 2.0f) + offset.y / 2.0f + y * offset.y, -((dim * offset.z) / 2.0f) + offset.z / 2.0f + z * offset.z);
*modelMat = glm::translate(glm::mat4(), pos);
*modelMat = glm::rotate(*modelMat, rotations[index].x, glm::vec3(1.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].y, glm::vec3(0.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].z, glm::vec3(0.0f, 0.0f, 1.0f));
}
}
}
 
animationTimer = 0.0f;
 
memcpy(uniformBuffers.dynamic.mapped, uboDataDynamic.model, uniformBuffers.dynamic.size);
isFilledDynamicBuffer = true;
 
 
// Flush to make changes visible to the host 
VkMappedMemoryRange memoryRange = vkTools::initializers::mappedMemoryRange();
memoryRange.memory = uniformBuffers.dynamic.memory;
memoryRange.size = uniformBuffers.dynamic.size;
 
vkFlushMappedMemoryRanges(device, 1, &memoryRange);
}

What can i use instead of memcpy(uniformBuffers.dynamic.mapped, uboDataDynamic.model, uniformBuffers.dynamic.size); for update of only some matrixes instead of  the entire buffer?
Thanks.
2)can I use the dynamic buffers with vulkan indirectdraw?
thanks.

Share this post


Link to post
Share on other sites

how i can copy only some matrixes to the gpu instead copy the entire buffer?.

 

Of course you can change only some matrices in the mapped memory, but you still need to upload the whole memory between them (or use multiple uploads to exclude some large memory regions where no changes occur).

 

An alternative would be to upload only the changed matrices plus an array identifying their target indices in a small seperate buffer, and use a compute shader to copy the matrices to their correct location on GPU.

 

 

Knowing not much about UBOs i share your questions, but i made this related experiment:

 

Render 2 million cubes, each with its own transformation matrix (FPS around 60 IIRC).

Each vertex had an integer index to its matrix in the w component of its position, matrices simply stored in a single large buffer. (There was no performance difference between using Uniform or Stroage Buffer for this.)

 

This way i can draw all cubes with one draw, similar to a skinned mesh. I have not tested if this is faster than using 2 million draws and identifying the proper matrix per object.

I assume it depends on vertex count per object and expect those differences:

Per Draw: Matrix can be copied to some fast constants read only memory before doing the draw.

Per Vertex: Each vertex needs to fetch it's own matrix, probably in cache.

 

But i do not really know. Maybe some graphics guru can share some insights about this... :)

Share this post


Link to post
Share on other sites

use a compute shader to copy the matrices to their correct location on GPU.

in pratical instead of tests that i'm doing the real application must have a piking that change the color of the selected object, a visible / invisible bool integer and a color that may be change sometime in indirectdraw .
for the piking that is the most important can you advice to me how do with a compute shader?there are architectural or simple examples for the picking and bounding boxes?
thanks

Share this post


Link to post
Share on other sites

So you have a number of unordered objects on screen and the user picks one with a mouse cursor or crosshair?

 

The fastest way would be raytracing (e.g. for a crosshair this would be simply the front vector of the camera transform).

If you have really many objects you might wanna use a hirarchy, octree or whatever to speed things up but for a single ray this should not be necessary.

A typical ray - axis aligned bounding box test looks like this:

 

    bool TestRayAABox (const Ray& ray, const AABox& box)
    {
        // returns false if ray is parallel with box face
        qVec3 t0 = qVec3(box.minmax[0] - ray.origin).MulPerElem (ray.invDirection);
        qVec3 t1 = qVec3(box.minmax[1] - ray.origin).MulPerElem (ray.invDirection);
        qVec3 tMin = t0.MinPerElem (t1);
        qVec3 tMax = t0.MaxPerElem (t1);
        qScalar ffd = tMin.MaxElem(); // front face distance (behind origin if inside box)
        qScalar bfd = tMax.MinElem(); // back face distance    

        return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
    }

    bool IntersectRayAABox (const Ray& ray, const AABox& box, float& t)
    {
        qScalar ffd, bfd;
        DistanceRayAABoxFrontAndBackface (ffd, bfd, ray, box);
        //RenderPoint (3, qVec3(ray.origin + ray.direction * ffd), 1,0,0);
        //RenderPoint (3, qVec3(ray.origin + ray.direction * bfd), 0,1,0);
    
        t = (ffd > 0) ? ffd : bfd; // returns always the first intersection with a face where point of intersection is ray.origin + ray.direction * t
        return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
    }
 

 

box.minmax[0] is minimum 3D box position and [1] is the maximum.

ray.invDirection is 1.0 / ray unit direction

This is a cpu simd implementation, all those member functions from qVec3 should be self explaining and are available also for GLSL vectors.

You would just iterate over all boxes and keep the closest hit.

 

You can do this with compute shaders, but for only a single ray doing it on CPU would make more sense even with 1000 boxes. GPU is good for thousands of rays, or a huge number of boxes (100000?) in parallel.

 

If you look for introduction to compute shaders i recommend the chapter from OpenGL Super Bible.

Few pages there are enough not only to explain the tech but more importantly what's the real catch and power of parallel programming. (But be warned they have some barriers in the wrong order. memoryBarrierShared() followed by barrier() is correct.)

Vulkan changed only the things necessary on CPU side, shaders themselves stay the same. Sascha Willems code on github has examples for this in Vulkan.

Share this post


Link to post
Share on other sites
thanks you are very kind.
Only 2 questions and i start reading:
1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.
2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?
thanks now i start to reading the superbible.
hello.

use a compute shader to copy the matrices to their correct location on GPU.

what mean this? i don't understand. Edited by giugio1977

Share this post


Link to post
Share on other sites

1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.

 

I assume you want to check the bounding boxes of all 100000 objects aginst a single picking ray.

You can brute force this on GPU. Say the GPU has 2000 threads, then each one would only do 50 tests.

 

So this works well, but only if you have (and animate) the box data already on GPU. If you need to upload the whole box data from system memory each frame, the upload would take longer than the tests.

I guess it's still doable depending on your needs, but if only a few objects move per frame you should indeed upload only the changes:

 

use a compute shader to copy the matrices to their correct location on GPU.

 

E.g. you have 5 objects on GPU and we write them by array index and the data as a letter:

(0:a),(1:b),(2:c),(3:d),(4:e)

 

Now you decide on CPU that object 0 and 3 have to change their letter.

You would upload this compact data:

(0:x),(3:y)

 

Then you run a compute shader that copies the uploaded data to the original locations, effectively updating the data, so you have:

(0:x),(1:b),(2:c),(3:y),(4:e)

 

Easy, but that's what i mean. Limits the cost of the expensive upload operation.

 

 

2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?

 

No idea what you mean here - language problems :)

But one thing: It's unlikely you keep using the same memory layout for your data on GPU.

 

E.g. you have on CPU:

 

struct object

{

vec boxMin, vec boxMax;

mat4x4 transform;

} objects[100000];

 

You likely end up on GPU with something like:

 

vec3 boxMin[100000];

vec3 boxMax[100000];

vec3 pos[100000];

vec4 quat[100000]; // if you decide quaternion instead matrix can save bandwidth and GPU registers

 

... so more the SOA than the AOS way.

This is because it's likely all GPU threads read boxMin at the same time, so having them close together is better even if in the next instruction reads boxMax.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628730
    • Total Posts
      2984431
  • Similar Content

    • By L. Spiro
      Home: https://www.khronos.org/vulkan/
      SDK: http://lunarg.com/vulkan-sdk/
       
      AMD drivers: http://gpuopen.com/gaming-product/vulkan/ (Note that Vulkan support is now part of AMD’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
      NVIDIA drivers: https://developer.nvidia.com/vulkan-driver (Note that Vulkan support is now part of NVIDIA’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
      Intel drivers: http://blogs.intel.com/evangelists/2016/02/16/intel-open-source-graphics-drivers-now-support-vulkan/
       
      Quick reference: https://www.khronos.org/registry/vulkan/specs/1.0/refguide/Vulkan-1.0-web.pdf
      References: https://www.khronos.org/registry/vulkan/specs/1.0/apispec.html
      https://matthewwellings.com/blog/the-new-vulkan-coordinate-system/
       
      GLSL-to-SPIR-V: https://github.com/KhronosGroup/glslang

      Sample code: https://github.com/LunarG/VulkanSamples
      https://github.com/SaschaWillems/Vulkan
      https://github.com/nvpro-samples
      https://github.com/nvpro-samples/gl_vk_chopper
      https://github.com/nvpro-samples/gl_vk_threaded_cadscene
      https://github.com/nvpro-samples/gl_vk_bk3dthreaded
      https://github.com/nvpro-samples/gl_vk_supersampled
      https://github.com/McNopper/Vulkan
      https://github.com/GPUOpen-LibrariesAndSDKs/HelloVulkan
       
      C++: https://github.com/nvpro-pipeline/vkcpp
      https://developer.nvidia.com/open-source-vulkan-c-api

      Getting started: https://vulkan-tutorial.com/
      https://renderdoc.org/vulkan-in-30-minutes.html
      https://www.khronos.org/news/events/vulkan-webinar
      https://developer.nvidia.com/engaging-voyage-vulkan
      https://developer.nvidia.com/vulkan-shader-resource-binding
      https://developer.nvidia.com/vulkan-memory-management
      https://developer.nvidia.com/opengl-vulkan
      https://github.com/vinjn/awesome-vulkan

      Videos: https://www.youtube.com/playlist?list=PLYO7XTAX41FPg08uM_bgPE9HLgDAyzDaZ

      Utilities: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator (AMD Memory allocator.)
      https://github.com/GPUOpen-LibrariesAndSDKs/Anvil (AMD Miniature Vulkan engine/framework.)

       
      L. Spiro
    • By hiya83
      (Posted this in graphics forum too, which was perhaps the wrong forum for it)
      Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?
      Thanks!
    • By hiya83
      Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?
    • By mark_braga
      I am working on a project which needs to share render targets between Vulkan and DirectX12. I have enabled the external memory extension and now allocate the memory for the render targets by adding the VkExportMemoryInfoKHR to the pNext chain of VkMemoryAllocateInfo. Similarly I have added the VkExternalMemoryImageCreateInfo to the pNext chain of VkImageCreateInfo.
      After calling the get win32 handle function, I get some handle pointer which is not null (I assume it is valid).
      VkExternalMemoryImageCreateInfoKHR externalImageInfo = {}; if (gExternalMemoryExtensionKHR) { externalImageInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO_KHR; externalImageInfo.pNext = NULL; externalImageInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KH imageCreateInfo.pNext = &externalImageInfo; } vkCreateImage(...); VkExportMemoryAllocateInfoKHR exportInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; exportInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; memoryAllocateInfo.pNext = &exportInfo; vkAllocateMemory(...); VkMemoryGetWin32HandleInfoKHR info = { VK_STRUCTURE_TYPE_MEMORY_GET_WIN32_HANDLE_INFO_KHR, NULL }; info.memory = pTexture->GetMemory(); info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; VkResult res = vkGetMemoryWin32HandleKHR(vulkanDevice, &info, &pTexture->pSharedHandle); ASSERT(VK_SUCCESS == res); Now when I try to call OpenSharedHandle from a D3D12 device, it crashes inside nvwgf2umx.dll with the integer division by zero error.
      I am now lost and have no idea what the other handle types do.
      For example: How do we get the D3D12 resource from the VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR handle?
      I also found some documentation on this link but it doesn't help much.
      https://javadoc.lwjgl.org/org/lwjgl/vulkan/NVExternalMemoryWin32.html
      This is all assuming the extension works as expected since it has made it to the KHR
    • By dwatt
      I am trying to get vulkan on android working and have run into a big issue. I don't see any validation layers available. I have tried linking their libraries into mine but still no layers. I have tried compiling it into a library of my own but the headers for it are all over the place. Unfortunately , google's examples and tutorials are out of date and don't work for me. Any idea what I have to do to get those layers to work?
  • Popular Now