Sign in to follow this  

Vulkan render huge amount of objects

This topic is 485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hellow!!!

In modern GPU, modern graphic api dx12,vulkan, how many objects   can be drawn at most in 60fps  ? and with  one light?

My scene with 100 boxes and a direction light  runs 15 fps. I 'm not sure is it normal?

I have a look at horde3d engine, it seems  he draws 100 crowded animated models without using instancing , but still runs smoothly, I guess it may be faster

than 60fps , how can he do it?

Need tutorial/links abount rendering big big scene.

Edited by poigwym

Share this post


Link to post
Share on other sites
Same here, show some details and we'll take a peak. Assuming the boxes are made up of 8 vertices and 12 triangles, I agree that it's not OK.

It also helps to no on which hardware/ GPU you're running it (just to be sure)

Share this post


Link to post
Share on other sites

I have a profile in my engine, and found that the time spent in matrix multiplication and cbuffer commit are most among all instructions.

 

After I shut down all light, The process is simple,  for every object, update transform cbuffer  and commit cbuffer, and draw.

 

In my engine .all shader share 10 cbuffer(5 for vs, 5 for ps).

one cbuffer look like these:
struct CBTransform /*: register(b0)*/
{
  Matrix4f world_matrix;
  Matrix4f world_invTrans_matrix;
  Matrix4f world_view_proj_matrix;
  Matrix4f light_view_proj_matrix;
};
 
 
setTransform(Renderable node)
{
// update transform cb
 CBTransform *p = reinterpret_cast<CBTransform*>(renderbase->MapResource(_cbtrans, 0, D3D11_MAP_WRITE_DISCARD, 0));
  
  Matrix4f world;
  if (node->_hasBone)
      world.initIdentity();
  else
      world = node->getTransform();
   p->world_matrix = world;
   p->world_invTrans_matrix = world;
   Camera *camera = Engine::sceneManager().getMainCamera();
   Matrix4f view = camera->getView();
   Matrix4f proj = camera->getProjection();
   Matrix4f mwvp = world *view *proj;
    p->world_view_proj_matrix = mwvp;
 
   if (_curlight) {
     Matrix4f lightTrans = world * _curlight->getLightTransform(); // world* viewproj
     p->light_view_proj_matrix = lightTrans;
   }
 
 
   renderbase->unMapResource(_cbtrans, 0);
}
 
// since all shaders share 10 cbuffer, I pass 10 to gpu at every draw call. I'm not sure if the method is right??
_context->VSSetConstantBuffers(0, (int)CBufType::MAX_CBUF_GROUP, _cBufs[(int)ShaderType::VERTEX_SHADER]);
_context->PSSetConstantBuffers(0, (int)CBufType::MAX_CBUF_GROUP, _cBufs[(int)ShaderType::PIXEL_SHADER]);
 

Share this post


Link to post
Share on other sites

hehe, I  forget to say those  100   boxex that I draw have the same look, and use same vertex buffer, but don't use instancing technique.

I need to  update 100 times cbuffer and commit 100 times cbuffer per frame.


Is it possible to draw 100 dynamic boxes that has different vertex buffer and texture and not using instancing technique within 60fps  in modern gpu? My cpu and gpu is a little old.  

Edited by poigwym

Share this post


Link to post
Share on other sites

If you are on a desktop you can see and old flash demo I did here to test what your gpu can handle.

this is in flash by the way so native you should be able to beat what you see here (it is not massively optimised either)

 

There is no instancing and each object has a unique transform, the only thing constant between draws is the material.

 

Lower end gpus should be 500-1000 no problems, mid range 1500-3000, high end can hit 8,000+

 

http://blog.bwhiting.co.uk/?p=314 

Share this post


Link to post
Share on other sites

Not sure if I understood you correctly, but if you're using 10 CBuffers to render 10 boxes with some (forward) lighting, you could do with 2 constant buffers (not 10):

 

1. a CB per frame, containing possible viewProjection matrix and your light properties (for multiple lights)

2. a CB per object, which you update for each update, after the last one is drawn

 

Both having a corresponding C++ struct in your code.

If you're using 10 different CBuffers, that might explain a part of the unexpected performance.

Share this post


Link to post
Share on other sites

If you are on a desktop you can see and old flash demo I did here to test what your gpu can handle.

this is in flash by the way so native you should be able to beat what you see here (it is not massively optimised either)

 

There is no instancing and each object has a unique transform, the only thing constant between draws is the material.

 

Lower end gpus should be 500-1000 no problems, mid range 1500-3000, high end can hit 8,000+

 

http://blog.bwhiting.co.uk/?p=314

 

My computer runs you demo at 30fps  with 1000+ total render objects.  can you show your code how to update transform ?  1000+ times?

Share this post


Link to post
Share on other sites

Not sure if I understood you correctly, but if you're using 10 CBuffers to render 10 boxes with some (forward) lighting, you could do with 2 constant buffers (not 10):

 

1. a CB per frame, containing possible viewProjection matrix and your light properties (for multiple lights)

2. a CB per object, which you update for each update, after the last one is drawn

 

Both having a corresponding C++ struct in your code.

If you're using 10 different CBuffers, that might explain a part of the unexpected performance.

I only use one cbuffer to hold transform matrix, 10 cbuffer are totally different, 5 for vertex shader, 5 for pixel shader. 

5 cbuf are transform, light, camera, skin, frame.

Since I use one cbuf for transformation matrices, I guess it would cause something called cpu-gpu contention..

Edited by poigwym

Share this post


Link to post
Share on other sites

I do it slightly different than some.

Each 3d object has a transform. This has getters/setters for scale/position/rotation.

There is a 32bit dirty flag that is updated through the setters. So any any given time you can know if the scale, rotation or position has been change and in detail too, i.e. which component.

When a matrix is required, the transform is requested to build it, if the dirty flag is non-zero the matrix needs rebuilding. Depending on different flags set it will do it differently. Scales and transforms are very fast, just directly set the values. But If rotations are included then a more complex recompose is done (sin/cos etc..) you can do this a number of ways and look on line for various approaches.

A common one is to build each rotation matrix required and combine it.

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

You don't have to do it this way of course you can just operate directly on a matrix appending transformations to it as you wish (would probably be faster).

 

 

The view/projection matrix is calculated once per frame and shared across each draw call, only the world matrix is updated in the buffer between calls, so that is just copying 16 floats into the buffer and nothing else - should be pretty quick.

 

Hope that helps.

Share this post


Link to post
Share on other sites

I do it slightly different than some.

Each 3d object has a transform. This has getters/setters for scale/position/rotation.

There is a 32bit dirty flag that is updated through the setters. So any any given time you can know if the scale, rotation or position has been change and in detail too, i.e. which component.

When a matrix is required, the transform is requested to build it, if the dirty flag is non-zero the matrix needs rebuilding. Depending on different flags set it will do it differently. Scales and transforms are very fast, just directly set the values. But If rotations are included then a more complex recompose is done (sin/cos etc..) you can do this a number of ways and look on line for various approaches.

A common one is to build each rotation matrix required and combine it.

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

You don't have to do it this way of course you can just operate directly on a matrix appending transformations to it as you wish (would probably be faster).

 

 

The view/projection matrix is calculated once per frame and shared across each draw call, only the world matrix is updated in the buffer between calls, so that is just copying 16 floats into the buffer and nothing else - should be pretty quick.

 

Hope that helps.

 

wow , you reply so quickly!!! I 'm the first time to feel there's a chat online rather than waiting many hours .

Share this post


Link to post
Share on other sites

Then if the transform has a parent it needs to be concatenated with it's transform too, managing these relationship updates can be tricky and I am still not sold on the best way to do it.

 

 

Are you afraid the parent does not dirty but the child has dirt? then you will omit updating the child's transform?

I solve it by using 3 flag call "childNeedUpdate" "selfDirt", "parentHasUpdate", when the child change it's transform , it set all it's ancestors' childNeedUpdate to true.When update the tree, if the node only has a "childNeedUpdate" flag, it doesn't need to recompute the transform matrix, it just act as a bridge to call child's update.   The node updates when either "parentHasUpdate" or "selfDirt" is true, only these two situation

the node will recompute transform matrix, and must remember to set all childs' "parentHasUpdate" to true. then call childs to update.

 

I think it is fast, but  haven't test it with huge amount objects. 

Edited by poigwym

Share this post


Link to post
Share on other sites
The method I've used in the past is to put all the hierarchical transforms into a big array, and make sure it's sorted by hierachical depth -- i.e. parents always appear in the array before their children. Something like:
struct Node
{
  Matrix4x4 localToParent;//modified when moving the node around
  Matrix4x4 localToWorld;//computed per frame - this node's world matrix
  int parent;//-1 for root nodes
};

std::vector<Node> nodes;

//Then the update loop is super simple
for(int i = 0, end = nodes.size(); i != end; ++i )
{
  const Node& n = nodes[i];
  assert( i > n.parent );//assert parents are updated before their children
  if( n.parent == -1 )//root node
    n.localToWorld = n.localToParent;
  else//child node
    n.localToWorld = m.localToParent * nodes[n.parent].localToWorld;  // local to parent * parent to world == local to world
}

Share this post


Link to post
Share on other sites

aaaaa I clicked on something and lost my essay of a message, I should really install a form saver plugin!!!

 

@hodgman, interesting and tidy approach but does it end up being more efficient that a normal tree traversal? I guess it depends how much changes from frame to frame, it nothing does then a full tree traversal for transform updating only is pointless. But sorting arrays sounds slow also.

 

I was aiming for a solution that only touches the minimal set of nodes to respond to a change but also scales well from zero changes to changes in every object in the scene. Me wants cake and eating it!

 

@poigwym, flags like that should work well I think.

Share this post


Link to post
Share on other sites

@hodgman thinking about this further, are you suggesting that you only add nodes into that array in reaction to something changing? Infact scratch that. you would also have to add any child nodes too in that case and it wouldn't work if a transform was changed multiple times.

 

There will always need to be a complete hierarchy pass then I think, I can't see how to avoid it. In which case it still makes sense to just update all transforms 

 

Some odd cases to think about.

  • Leaf node modified followed by parent followed by its parent... all the way up to the root in that order. 
  • Root node modified (all children will need updating)
  • Leaf node modified followed by its parent's parent alternating all the way to the root.

There are 2 things at play with transforms the way I see it, the local update of a matrix when it is changed... then the re-combining of all the child matrices - this is where I am struggling to see the optimal solution.

 

Rebuild from scratch? Update and recombine using a 3rd snapshot matrix that represents the hierarchy above? Some other genius idea of justice?

 

EDIT::

If I get time I might make a 2d test bed to test this, a simple visual 2d tree that is update-able via mouse drags. I can then try various approaches and rather than benchmark I can compare how much work is done/or saved.

Edited by bwhiting

Share this post


Link to post
Share on other sites

Storing trees in linear arrays is always good, most if the tree structure remains static (e.g. a character).

That does not mean you have to process the whole tree even if there ar only a few changes.

The advantage is cache friendly linear memory access. You get this also for partial updates, if you use a nice memory order (typically sorted by tree level as hodgman said, and all children of any node in gapless order).

 

However, 100 is a small number and i can't imagine tree traversal or transformation can cause such low fps.

Do you upload each transform individually to GPU? It seems you map / unmap buffers for each object - that's slow and is probably the reason.

Use a single buffer containing all transforms instead so you have only one upload per frame.

Also make sure all render data (transforms and vertices) are on gpu memory and not on host memory.

Share this post


Link to post
Share on other sites

One thing I notice is that you do world *view *proj in setTransform every time. Can't you just calculate view*proj once per frame, and pass that in to be multiplied with the world transform? That would cut the matrix multiplications by 66%, if I'm understanding this correctly.

Share this post


Link to post
Share on other sites

I have make many optimization , the cbufs are shared by all shader, so I can commit all cbuf to gpu once by a call  setConstantbuffer , and  I  place the view_proj_matrix to perframe cbuf, so there's left world_matrix need to be updated to cbuf .

But no apparent improvement . And finally found 3 place that lead to low fps. 

The Map/unMap, 10~20%  and drawIndexed 30~50% update scene graph 30~50% ( I update whole tree without any prune ).

I don't know why drawIndex cost nearly most among these instructions.

Do you know why  all other operations nearly occupy no time ? (I have flags to record material, shader, render state.... so

100 boxes shared same material, same shader  same vb, ib, and render state, which I think few days to work out, so very little time was cost).

By the way the std::map::find, std::unordered_map::find are amazingly time consuming too, the cpp FAQ says std::unordered_map ::find

should perform faster than hash table ( O(1) search even string key ?? hehe ) .  I use std::string technique name to look for technique 

in material,  it seems 100 std::map::find or std::unordered_map::find will slow down frame rate!!!

 

 

Some optimization earlier,  my resource manager that has cache mechanism, update the lru queue everytime I find resource by handle,

though O(1) search by handle is fast, but update the lru raise the complexity to O(n).  I have no choice but to shut down the update lru,

and gain more than 10 fps faster. But the scene with 100 boxex and a animated model, and one light, still run 10+fps .

Edited by poigwym

Share this post


Link to post
Share on other sites

 

Use a single buffer containing all transforms instead so you have only one upload per frame.

Also make sure all render data (transforms and vertices) are on gpu memory and not on host memory.

 

Yes, I upload each world-matrix, even  world-normal-matrix to gpu every object.  haha.

But if don't use one single buffer containing all transforms ( isn't it called instancing ??), how can I draw it faster than 60fps.... 100fps??

Edited by poigwym

Share this post


Link to post
Share on other sites

Im not sure, but i think 'instancing' is no more with Vulkan/DX12 - there is no more need for it ?

 

I started with Vulkan for some weeks and made a similar test, but forgot the exact results:

2 million (!) unlit textured cubes, each its own transform and vertices: about 80 fps or more.

Next test: Upload the matrices each frame using a single buffer: Huge performance drop - can't remember numbers (about 10 - 100x), but impossible keep 60 fps - AGP bottleneck - nothing to do about it.

I've used a single draw call to render the boxes.

 

Using Modern API you can implement whole graphics engine in one draw call, e.g.:

Prebuild command buffer to do following steps:

Compute shader to multiply worldspace matrices with projection.

Compute shader to do frustum culling, writing remaining draw count and indices to a buffer

Single Indirect draw call consuming data from that buffer.

 

Per frame, you just need to upload a single buffer containing the (changed) worldspace matrices and call the command buffer.

In real life it won't be that simple (streaming etc.), but i think that's the way to go.

If the GPU is clever an can process the command buffer on its own there is almost no need for slow CPU/GPU interaction,

and 'multithreaded command buffer creation' - the hyped killer feature of new APs is not even necessary at all.

 

But all this can't explain your low fps of 15 for 100 boxes - there must be something you do seriously wrong.

E.g. once i made a mistake on displaying measured GPU time, showing 10ms but in reality it was 1ms :D - took days to figure out.

Maybe you should look for a simple hello world box example, modify it for more boxes and compare...

Share this post


Link to post
Share on other sites

Lower end gpus should be 500-1000 no problems, mid range 1500-3000, high end can hit 8,000+

My computer runs you demo at 30fps  with 1000+ total render objects.  can you show your code how to update transform ?  1000+ times?

What is your hardware? I got above 60k+ objects before the fps dropped to 30 :o
 

@hodgman, interesting and tidy approach but does it end up being more efficient that a normal tree traversal? I guess it depends how much changes from frame to frame, it nothing does then a full tree traversal for transform updating only is pointless. But sorting arrays sounds slow also.   I was aiming for a solution that only touches the minimal set of nodes to respond to a change but also scales well from zero changes to changes in every object in the scene. Me wants cake and eating it!

In this case, I would guess that transferring two matrices from RAM and writing one back to RAM will take a lot more time than the clock cycles of computing a matrix multiplication, meaning the CPU will be idle for most of this process... Therefore, the bottleneck would be memory bandwidth, not processing power, and you should optimize for that. Keeping the data in a contiguous array and iterating through it in a predictable patterns allows the hardware pre-fetcher to automatically scan ahead and start downloading future elements of the array before they're required, reducing cache misses / observed memory latency. There's also random accesses to parent data, but these are typically small jumps backwards in the array, which are almost guaranteed to still be present in the L1 cache, so incur no extra bandwidth.

This difference in memory access patterns could mean that using a linear array and processing every element takes the same amount of time as using a randomly allocated tree (e.g. every node is allocated individually with new) and only processing 10% of the elements! :o
 
Moreover, it's often important in games to optimize for the worst case behaviour of an algorithm, not the average case or best case.
e.g. if you're making an FPS with 60 players, each with 100 bones, that's a worst-case of 6000 character nodes per frame to recompute. If no one is on screen, it's a best case of 0 character nodes :)
It's all well and good to make sure that your framerate is amazing when there's no players on screen, but the most important thing is to make sure that the framerate is acceptable when every player pops up on your screen at once.
 
You can also add a level of hierarchy on top of this by allocating one array per character. In the above example, that would mean the character models would use 60 individual arrays of 100 nodes each, rather than one global array with 6000 nodes. You can then cull a model if they're off-screen, which saves you from processing 100 nodes in one go.
If different models can be attached to each other, then you just need to sort your models so that attachments update after their parents -- sorting an array of 60 objects is so cheap you almost won't be able to measure it. Within each sub-array, you'll never have to resort the data in most cases -- e.g. the ankle is always a child of the knee, and you're not going to change that at runtime. So this data gets sorted when creating the model file, and doesn't require re-sorting at runtime.
 

There are 2 things at play with transforms the way I see it, the local update of a matrix when it is changed... then the re-combining of all the child matrices - this is where I am struggling to see the optimal solution.

Yep. The local matrices are the outputs of your animation, physics and specialized gameplay systems -- different game objects will get their local matrix data from different sources.
Once all the local matrix data has been computed, you can recompute all the world matrices by walking down your hierarchy and propagating parent to child.
You can try to skip work here by using dirty flags, etc - only updating nodes that have been modified (and nodes who's parent been modified!)... or you can just recompute all of them :)

Share this post


Link to post
Share on other sites

Im not sure, but i think 'instancing' is no more with Vulkan/DX12 - there is no more need for it ?

 

I started with Vulkan for some weeks and made a similar test, but forgot the exact results:

2 million (!) unlit textured cubes, each its own transform and vertices: about 80 fps or more.

Next test: Upload the matrices each frame using a single buffer: Huge performance drop - can't remember numbers (about 10 - 100x), but impossible keep 60 fps - AGP bottleneck - nothing to do about it.

I've used a single draw call to render the boxes.

 

 

I have 100 draw calls, context->drawIndexed  per object.

Share this post


Link to post
Share on other sites

This topic is 485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628699
    • Total Posts
      2984277
  • Similar Content

    • By L. Spiro
      Home: https://www.khronos.org/vulkan/
      SDK: http://lunarg.com/vulkan-sdk/
       
      AMD drivers: http://gpuopen.com/gaming-product/vulkan/ (Note that Vulkan support is now part of AMD’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
      NVIDIA drivers: https://developer.nvidia.com/vulkan-driver (Note that Vulkan support is now part of NVIDIA’s official drivers, so simply getting the latest drivers for your card should give you Vulkan support.)
      Intel drivers: http://blogs.intel.com/evangelists/2016/02/16/intel-open-source-graphics-drivers-now-support-vulkan/
       
      Quick reference: https://www.khronos.org/registry/vulkan/specs/1.0/refguide/Vulkan-1.0-web.pdf
      References: https://www.khronos.org/registry/vulkan/specs/1.0/apispec.html
      https://matthewwellings.com/blog/the-new-vulkan-coordinate-system/
       
      GLSL-to-SPIR-V: https://github.com/KhronosGroup/glslang

      Sample code: https://github.com/LunarG/VulkanSamples
      https://github.com/SaschaWillems/Vulkan
      https://github.com/nvpro-samples
      https://github.com/nvpro-samples/gl_vk_chopper
      https://github.com/nvpro-samples/gl_vk_threaded_cadscene
      https://github.com/nvpro-samples/gl_vk_bk3dthreaded
      https://github.com/nvpro-samples/gl_vk_supersampled
      https://github.com/McNopper/Vulkan
      https://github.com/GPUOpen-LibrariesAndSDKs/HelloVulkan
       
      C++: https://github.com/nvpro-pipeline/vkcpp
      https://developer.nvidia.com/open-source-vulkan-c-api

      Getting started: https://vulkan-tutorial.com/
      https://renderdoc.org/vulkan-in-30-minutes.html
      https://www.khronos.org/news/events/vulkan-webinar
      https://developer.nvidia.com/engaging-voyage-vulkan
      https://developer.nvidia.com/vulkan-shader-resource-binding
      https://developer.nvidia.com/vulkan-memory-management
      https://developer.nvidia.com/opengl-vulkan
      https://github.com/vinjn/awesome-vulkan

      Videos: https://www.youtube.com/playlist?list=PLYO7XTAX41FPg08uM_bgPE9HLgDAyzDaZ

      Utilities: https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator (AMD Memory allocator.)
      https://github.com/GPUOpen-LibrariesAndSDKs/Anvil (AMD Miniature Vulkan engine/framework.)

       
      L. Spiro
    • By hiya83
      (Posted this in graphics forum too, which was perhaps the wrong forum for it)
      Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?
      Thanks!
    • By hiya83
      Hey, I was wondering if on mobile development (Android mainly but iOS as well if you know of it), if there is a GPUView equivalent for whole system debugging so we can figure out if the CPU/GPU are being pipelined efficiently, if there are bubbles, etc. Also slightly tangent question, but do mobile GPU's have a DMA engine exposed as a dedicated Transfer Queue for Vulkan?
    • By mark_braga
      I am working on a project which needs to share render targets between Vulkan and DirectX12. I have enabled the external memory extension and now allocate the memory for the render targets by adding the VkExportMemoryInfoKHR to the pNext chain of VkMemoryAllocateInfo. Similarly I have added the VkExternalMemoryImageCreateInfo to the pNext chain of VkImageCreateInfo.
      After calling the get win32 handle function, I get some handle pointer which is not null (I assume it is valid).
      VkExternalMemoryImageCreateInfoKHR externalImageInfo = {}; if (gExternalMemoryExtensionKHR) { externalImageInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO_KHR; externalImageInfo.pNext = NULL; externalImageInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KH imageCreateInfo.pNext = &externalImageInfo; } vkCreateImage(...); VkExportMemoryAllocateInfoKHR exportInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; exportInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; memoryAllocateInfo.pNext = &exportInfo; vkAllocateMemory(...); VkMemoryGetWin32HandleInfoKHR info = { VK_STRUCTURE_TYPE_MEMORY_GET_WIN32_HANDLE_INFO_KHR, NULL }; info.memory = pTexture->GetMemory(); info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; VkResult res = vkGetMemoryWin32HandleKHR(vulkanDevice, &info, &pTexture->pSharedHandle); ASSERT(VK_SUCCESS == res); Now when I try to call OpenSharedHandle from a D3D12 device, it crashes inside nvwgf2umx.dll with the integer division by zero error.
      I am now lost and have no idea what the other handle types do.
      For example: How do we get the D3D12 resource from the VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR handle?
      I also found some documentation on this link but it doesn't help much.
      https://javadoc.lwjgl.org/org/lwjgl/vulkan/NVExternalMemoryWin32.html
      This is all assuming the extension works as expected since it has made it to the KHR
    • By dwatt
      I am trying to get vulkan on android working and have run into a big issue. I don't see any validation layers available. I have tried linking their libraries into mine but still no layers. I have tried compiling it into a library of my own but the headers for it are all over the place. Unfortunately , google's examples and tutorials are out of date and don't work for me. Any idea what I have to do to get those layers to work?
  • Popular Now