Jump to content
  • Advertisement
Sign in to follow this  
giugio1977

Vulkan indirect drawing and dynamic uniform buffer

This topic is 569 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hello.
I'm trying to create an indirect drawing with vulkan.
All works fine but now i would create some states variables to send per objects drawing.
For example : if i have 3 cubes and i would change the color only at the first, how i can do?
The dynamic uniform buffers are made for this?
I read that i can change only one state variable how i do?
I must understand how i can use dynamic uniform buffers for change the state only to one or more objects without remap all the data buffer.
 
thanks.

Share this post


Link to post
Share on other sites
Advertisement

hello.
I understand in part the dinamyc buffers.
in my program i have a dynamic buffer with only one matrix ,but how i can copy only some matrixes to the gpu instead copy the entire buffer?.
here from an example of vulkan sdk
 

void updateDynamicUniformBuffer(bool force = false)
{
// Update at max. 60 fps
animationTimer += frameTimer;
if ((animationTimer <= 1.0f / 60.0f) && (!force)) {
return;
}
 
// Dynamic ubo with per-object model matrices indexed by offsets in the command buffer
uint32_t dim = static_cast<uint32_t>(pow(OBJECT_INSTANCES, (1.0f / 3.0f)));
glm::vec3 offset(5.0f);
 
for (uint32_t x = 0; x < dim; x++)
{
for (uint32_t y = 0; y < dim; y++)
{
for (uint32_t z = 0; z < dim; z++)
{
uint32_t index = x * dim * dim + y * dim + z;
 
// Aligned offset
glm::mat4* modelMat = (glm::mat4*)(((uint64_t)uboDataDynamic.model + (index * dynamicAlignment)));
 
// Update rotations
rotations[index] += animationTimer * rotationSpeeds[index];
 
// Update matrices
glm::vec3 pos = glm::vec3(-((dim * offset.x) / 2.0f) + offset.x / 2.0f + x * offset.x, -((dim * offset.y) / 2.0f) + offset.y / 2.0f + y * offset.y, -((dim * offset.z) / 2.0f) + offset.z / 2.0f + z * offset.z);
*modelMat = glm::translate(glm::mat4(), pos);
*modelMat = glm::rotate(*modelMat, rotations[index].x, glm::vec3(1.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].y, glm::vec3(0.0f, 1.0f, 0.0f));
*modelMat = glm::rotate(*modelMat, rotations[index].z, glm::vec3(0.0f, 0.0f, 1.0f));
}
}
}
 
animationTimer = 0.0f;
 
memcpy(uniformBuffers.dynamic.mapped, uboDataDynamic.model, uniformBuffers.dynamic.size);
isFilledDynamicBuffer = true;
 
 
// Flush to make changes visible to the host 
VkMappedMemoryRange memoryRange = vkTools::initializers::mappedMemoryRange();
memoryRange.memory = uniformBuffers.dynamic.memory;
memoryRange.size = uniformBuffers.dynamic.size;
 
vkFlushMappedMemoryRanges(device, 1, &memoryRange);
}

What can i use instead of memcpy(uniformBuffers.dynamic.mapped, uboDataDynamic.model, uniformBuffers.dynamic.size); for update of only some matrixes instead of  the entire buffer?
Thanks.
2)can I use the dynamic buffers with vulkan indirectdraw?
thanks.

Share this post


Link to post
Share on other sites

how i can copy only some matrixes to the gpu instead copy the entire buffer?.

 

Of course you can change only some matrices in the mapped memory, but you still need to upload the whole memory between them (or use multiple uploads to exclude some large memory regions where no changes occur).

 

An alternative would be to upload only the changed matrices plus an array identifying their target indices in a small seperate buffer, and use a compute shader to copy the matrices to their correct location on GPU.

 

 

Knowing not much about UBOs i share your questions, but i made this related experiment:

 

Render 2 million cubes, each with its own transformation matrix (FPS around 60 IIRC).

Each vertex had an integer index to its matrix in the w component of its position, matrices simply stored in a single large buffer. (There was no performance difference between using Uniform or Stroage Buffer for this.)

 

This way i can draw all cubes with one draw, similar to a skinned mesh. I have not tested if this is faster than using 2 million draws and identifying the proper matrix per object.

I assume it depends on vertex count per object and expect those differences:

Per Draw: Matrix can be copied to some fast constants read only memory before doing the draw.

Per Vertex: Each vertex needs to fetch it's own matrix, probably in cache.

 

But i do not really know. Maybe some graphics guru can share some insights about this... :)

Share this post


Link to post
Share on other sites

use a compute shader to copy the matrices to their correct location on GPU.

in pratical instead of tests that i'm doing the real application must have a piking that change the color of the selected object, a visible / invisible bool integer and a color that may be change sometime in indirectdraw .
for the piking that is the most important can you advice to me how do with a compute shader?there are architectural or simple examples for the picking and bounding boxes?
thanks

Share this post


Link to post
Share on other sites

So you have a number of unordered objects on screen and the user picks one with a mouse cursor or crosshair?

 

The fastest way would be raytracing (e.g. for a crosshair this would be simply the front vector of the camera transform).

If you have really many objects you might wanna use a hirarchy, octree or whatever to speed things up but for a single ray this should not be necessary.

A typical ray - axis aligned bounding box test looks like this:

 

    bool TestRayAABox (const Ray& ray, const AABox& box)
    {
        // returns false if ray is parallel with box face
        qVec3 t0 = qVec3(box.minmax[0] - ray.origin).MulPerElem (ray.invDirection);
        qVec3 t1 = qVec3(box.minmax[1] - ray.origin).MulPerElem (ray.invDirection);
        qVec3 tMin = t0.MinPerElem (t1);
        qVec3 tMax = t0.MaxPerElem (t1);
        qScalar ffd = tMin.MaxElem(); // front face distance (behind origin if inside box)
        qScalar bfd = tMax.MinElem(); // back face distance    

        return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
    }

    bool IntersectRayAABox (const Ray& ray, const AABox& box, float& t)
    {
        qScalar ffd, bfd;
        DistanceRayAABoxFrontAndBackface (ffd, bfd, ray, box);
        //RenderPoint (3, qVec3(ray.origin + ray.direction * ffd), 1,0,0);
        //RenderPoint (3, qVec3(ray.origin + ray.direction * bfd), 0,1,0);
    
        t = (ffd > 0) ? ffd : bfd; // returns always the first intersection with a face where point of intersection is ray.origin + ray.direction * t
        return (ffd <= bfd) & (bfd >= 0.0f) & (ffd <= ray.length);
    }
 

 

box.minmax[0] is minimum 3D box position and [1] is the maximum.

ray.invDirection is 1.0 / ray unit direction

This is a cpu simd implementation, all those member functions from qVec3 should be self explaining and are available also for GLSL vectors.

You would just iterate over all boxes and keep the closest hit.

 

You can do this with compute shaders, but for only a single ray doing it on CPU would make more sense even with 1000 boxes. GPU is good for thousands of rays, or a huge number of boxes (100000?) in parallel.

 

If you look for introduction to compute shaders i recommend the chapter from OpenGL Super Bible.

Few pages there are enough not only to explain the tech but more importantly what's the real catch and power of parallel programming. (But be warned they have some barriers in the wrong order. memoryBarrierShared() followed by barrier() is correct.)

Vulkan changed only the things necessary on CPU side, shaders themselves stay the same. Sascha Willems code on github has examples for this in Vulkan.

Share this post


Link to post
Share on other sites
thanks you are very kind.
Only 2 questions and i start reading:
1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.
2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?
thanks now i start to reading the superbible.
hello.

use a compute shader to copy the matrices to their correct location on GPU.

what mean this? i don't understand. Edited by giugio1977

Share this post


Link to post
Share on other sites

1)i have 4 lists of geometrics objects with 100000 objects in the worst case and create a scenegraph is a huge waste of memory, i ask you an example or a tutorial for do these things without the scenegraph if is possible.

 

I assume you want to check the bounding boxes of all 100000 objects aginst a single picking ray.

You can brute force this on GPU. Say the GPU has 2000 threads, then each one would only do 50 tests.

 

So this works well, but only if you have (and animate) the box data already on GPU. If you need to upload the whole box data from system memory each frame, the upload would take longer than the tests.

I guess it's still doable depending on your needs, but if only a few objects move per frame you should indeed upload only the changes:

 

use a compute shader to copy the matrices to their correct location on GPU.

 

E.g. you have 5 objects on GPU and we write them by array index and the data as a letter:

(0:a),(1:b),(2:c),(3:d),(4:e)

 

Now you decide on CPU that object 0 and 3 have to change their letter.

You would upload this compact data:

(0:x),(3:y)

 

Then you run a compute shader that copies the uploaded data to the original locations, effectively updating the data, so you have:

(0:x),(1:b),(2:c),(3:y),(4:e)

 

Easy, but that's what i mean. Limits the cost of the expensive upload operation.

 

 

2)I read something about compute shader, but i'm fine in some case if i can send a for object raw 64 bit address for find very fast the list object that is a c++ object but at the end this is only an index then i can use an ordered list?

 

No idea what you mean here - language problems :)

But one thing: It's unlikely you keep using the same memory layout for your data on GPU.

 

E.g. you have on CPU:

 

struct object

{

vec boxMin, vec boxMax;

mat4x4 transform;

} objects[100000];

 

You likely end up on GPU with something like:

 

vec3 boxMin[100000];

vec3 boxMax[100000];

vec3 pos[100000];

vec4 quat[100000]; // if you decide quaternion instead matrix can save bandwidth and GPU registers

 

... so more the SOA than the AOS way.

This is because it's likely all GPU threads read boxMin at the same time, so having them close together is better even if in the next instruction reads boxMax.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!