• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      Hi, I am having problems with all of my compute shaders in Vulkan. They are not writing to resources, even though there are no problems in the debug layer, every descriptor seem correctly bound in the graphics debugger, and the shaders definitely take time to execute. I understand that this is probably a bug in my implementation which is a bit complex, trying to emulate a DX11 style rendering API, but maybe I'm missing something trivial in my logic here? Currently I am doing these:
      Set descriptors, such as VK_DESCRIPTOR_TYPE_STORAGE_BUFFER for a read-write structured buffer (which is non formatted buffer) Bind descriptor table / validate correctness by debug layer Dispatch on graphics/compute queue, the same one that is feeding graphics rendering commands.  Insert memory barrier with both stagemasks as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT and srcAccessMask VK_ACCESS_SHADER_WRITE_BIT to dstAccessMask VK_ACCESS_SHADER_READ_BIT Also insert buffer memory barrier just for the storage buffer I wanted to write Both my application behaves like the buffers are empty, and Nsight debugger also shows empty buffers (ssems like everything initialized to 0). Also, I tried the most trivial shader, writing value of 1 to the first element of uint buffer. Am I missing something trivial here? What could be an other way to debug this further?
       
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

      View full story
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.
    • By mark_braga
      I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
      In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
      In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
      Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
      You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
      Now comes the command buffer submission part that is even more confusing.
      There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
      So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
      For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
      The same applies to VkDeviceGroupSubmitInfoKHR?
      Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
      void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }  
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
  • Advertisement
  • Advertisement
Sign in to follow this  

Vulkan render huge amount of objects

This topic is 618 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

If you are on a desktop you can see and old flash demo I did here to test what your gpu can handle.

this is in flash by the way so native you should be able to beat what you see here (it is not massively optimised either)

 

There is no instancing and each object has a unique transform, the only thing constant between draws is the material.

 

Lower end gpus should be 500-1000 no problems, mid range 1500-3000, high end can hit 8,000+

 

http://blog.bwhiting.co.uk/?p=314

 

My computer runs you demo at 30fps  with 1000+ total render objects.  can you show your code how to update transform ?  1000+ times?

Share this post


Link to post
Share on other sites
Advertisement

Not sure if I understood you correctly, but if you're using 10 CBuffers to render 10 boxes with some (forward) lighting, you could do with 2 constant buffers (not 10):

 

1. a CB per frame, containing possible viewProjection matrix and your light properties (for multiple lights)

2. a CB per object, which you update for each update, after the last one is drawn

 

Both having a corresponding C++ struct in your code.

If you're using 10 different CBuffers, that might explain a part of the unexpected performance.

I only use one cbuffer to hold transform matrix, 10 cbuffer are totally different, 5 for vertex shader, 5 for pixel shader. 

5 cbuf are transform, light, camera, skin, frame.

Since I use one cbuf for transformation matrices, I guess it would cause something called cpu-gpu contention..

Edited by poigwym

Share this post


Link to post
Share on other sites

I do it slightly different than some.

Each 3d object has a transform. This has getters/setters for scale/position/rotation.

There is a 32bit dirty flag that is updated through the setters. So any any given time you can know if the scale, rotation or position has been change and in detail too, i.e. which component.

When a matrix is required, the transform is requested to build it, if the dirty flag is non-zero the matrix needs rebuilding. Depending on different flags set it will do it differently. Scales and transforms are very fast, just directly set the values. But If rotations are included then a more complex recompose is done (sin/cos etc..) you can do this a number of ways and look on line for various approaches.

A common one is to build each rotation matrix required and combine it.

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

You don't have to do it this way of course you can just operate directly on a matrix appending transformations to it as you wish (would probably be faster).

 

 

The view/projection matrix is calculated once per frame and shared across each draw call, only the world matrix is updated in the buffer between calls, so that is just copying 16 floats into the buffer and nothing else - should be pretty quick.

 

Hope that helps.

Share this post


Link to post
Share on other sites

I do it slightly different than some.

Each 3d object has a transform. This has getters/setters for scale/position/rotation.

There is a 32bit dirty flag that is updated through the setters. So any any given time you can know if the scale, rotation or position has been change and in detail too, i.e. which component.

When a matrix is required, the transform is requested to build it, if the dirty flag is non-zero the matrix needs rebuilding. Depending on different flags set it will do it differently. Scales and transforms are very fast, just directly set the values. But If rotations are included then a more complex recompose is done (sin/cos etc..) you can do this a number of ways and look on line for various approaches.

A common one is to build each rotation matrix required and combine it.

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

You don't have to do it this way of course you can just operate directly on a matrix appending transformations to it as you wish (would probably be faster).

 

 

The view/projection matrix is calculated once per frame and shared across each draw call, only the world matrix is updated in the buffer between calls, so that is just copying 16 floats into the buffer and nothing else - should be pretty quick.

 

Hope that helps.

 

wow , you reply so quickly!!! I 'm the first time to feel there's a chat online rather than waiting many hours .

Share this post


Link to post
Share on other sites

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

 

Are you afraid the parent does not dirty but the child has dirt? then you will omit updating the child's transform?

I solve it by using 3 flag call "childNeedUpdate" "selfDirt", "parentHasUpdate", when the child change it's transform , it set all it's ancestors' childNeedUpdate to true.When update the tree, if the node only has a "childNeedUpdate" flag, it doesn't need to recompute the transform matrix, it just act as a bridge to call child's update.   The node updates when either "parentHasUpdate" or "selfDirt" is true, only these two situation

the node will recompute transform matrix, and must remember to set all childs' "parentHasUpdate" to true. then call childs to update.

 

I think it is fast, but  haven't test it with huge amount objects. 

Edited by poigwym

Share this post


Link to post
Share on other sites
The method I've used in the past is to put all the hierarchical transforms into a big array, and make sure it's sorted by hierachical depth -- i.e. parents always appear in the array before their children. Something like:
struct Node
{
  Matrix4x4 localToParent;//modified when moving the node around
  Matrix4x4 localToWorld;//computed per frame - this node's world matrix
  int parent;//-1 for root nodes
};

std::vector<Node> nodes;

//Then the update loop is super simple
for(int i = 0, end = nodes.size(); i != end; ++i )
{
  const Node& n = nodes[i];
  assert( i > n.parent );//assert parents are updated before their children
  if( n.parent == -1 )//root node
    n.localToWorld = n.localToParent;
  else//child node
    n.localToWorld = m.localToParent * nodes[n.parent].localToWorld;  // local to parent * parent to world == local to world
}

Share this post


Link to post
Share on other sites

aaaaa I clicked on something and lost my essay of a message, I should really install a form saver plugin!!!

 

@hodgman, interesting and tidy approach but does it end up being more efficient that a normal tree traversal? I guess it depends how much changes from frame to frame, it nothing does then a full tree traversal for transform updating only is pointless. But sorting arrays sounds slow also.

 

I was aiming for a solution that only touches the minimal set of nodes to respond to a change but also scales well from zero changes to changes in every object in the scene. Me wants cake and eating it!

 

@poigwym, flags like that should work well I think.

Share this post


Link to post
Share on other sites

@hodgman thinking about this further, are you suggesting that you only add nodes into that array in reaction to something changing? Infact scratch that. you would also have to add any child nodes too in that case and it wouldn't work if a transform was changed multiple times.

 

There will always need to be a complete hierarchy pass then I think, I can't see how to avoid it. In which case it still makes sense to just update all transforms 

 

Some odd cases to think about.

  • Leaf node modified followed by parent followed by its parent... all the way up to the root in that order. 
  • Root node modified (all children will need updating)
  • Leaf node modified followed by its parent's parent alternating all the way to the root.

There are 2 things at play with transforms the way I see it, the local update of a matrix when it is changed... then the re-combining of all the child matrices - this is where I am struggling to see the optimal solution.

 

Rebuild from scratch? Update and recombine using a 3rd snapshot matrix that represents the hierarchy above? Some other genius idea of justice?

 

EDIT::

If I get time I might make a 2d test bed to test this, a simple visual 2d tree that is update-able via mouse drags. I can then try various approaches and rather than benchmark I can compare how much work is done/or saved.

Edited by bwhiting

Share this post


Link to post
Share on other sites

Storing trees in linear arrays is always good, most if the tree structure remains static (e.g. a character).

That does not mean you have to process the whole tree even if there ar only a few changes.

The advantage is cache friendly linear memory access. You get this also for partial updates, if you use a nice memory order (typically sorted by tree level as hodgman said, and all children of any node in gapless order).

 

However, 100 is a small number and i can't imagine tree traversal or transformation can cause such low fps.

Do you upload each transform individually to GPU? It seems you map / unmap buffers for each object - that's slow and is probably the reason.

Use a single buffer containing all transforms instead so you have only one upload per frame.

Also make sure all render data (transforms and vertices) are on gpu memory and not on host memory.

Share this post


Link to post
Share on other sites

One thing I notice is that you do world *view *proj in setTransform every time. Can't you just calculate view*proj once per frame, and pass that in to be multiplied with the world transform? That would cut the matrix multiplications by 66%, if I'm understanding this correctly.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement