• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By turanszkij
      Hi, I am having problems with all of my compute shaders in Vulkan. They are not writing to resources, even though there are no problems in the debug layer, every descriptor seem correctly bound in the graphics debugger, and the shaders definitely take time to execute. I understand that this is probably a bug in my implementation which is a bit complex, trying to emulate a DX11 style rendering API, but maybe I'm missing something trivial in my logic here? Currently I am doing these:
      Set descriptors, such as VK_DESCRIPTOR_TYPE_STORAGE_BUFFER for a read-write structured buffer (which is non formatted buffer) Bind descriptor table / validate correctness by debug layer Dispatch on graphics/compute queue, the same one that is feeding graphics rendering commands.  Insert memory barrier with both stagemasks as VK_PIPELINE_STAGE_ALL_COMMANDS_BIT and srcAccessMask VK_ACCESS_SHADER_WRITE_BIT to dstAccessMask VK_ACCESS_SHADER_READ_BIT Also insert buffer memory barrier just for the storage buffer I wanted to write Both my application behaves like the buffers are empty, and Nsight debugger also shows empty buffers (ssems like everything initialized to 0). Also, I tried the most trivial shader, writing value of 1 to the first element of uint buffer. Am I missing something trivial here? What could be an other way to debug this further?
       
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

      View full story
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.
    • By mark_braga
      I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
      In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
      In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
      Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
      You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
      Now comes the command buffer submission part that is even more confusing.
      There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
      So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
      For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
      The same applies to VkDeviceGroupSubmitInfoKHR?
      Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
      void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }  
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
  • Advertisement
  • Advertisement
Sign in to follow this  

Vulkan Confusing performance with async compute

This topic is 494 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I started to do some testing about this with compute workloads only, so not the typical 'Do compute while rendering shadow maps' or similar.

Motivation is to keep the GPU busy in cases when a dispatch has little or zero work, which is often unavoidable.

 

I use Vulkan on FuryX, but i'm interested in any experience / opinions so i post it here.

 

Test shader uses 64 threads. Each Thread reads one float, then build prefix sum in LDS, writes back.

On API side do 100 dispatches, but each time only one thread group (so one wavefront), memory barrier between each dispatch to simulate dependcies.

Do all of this 3 times on 3 different memory buffers, so we could do each time async.

 

 

Here are the results:

 

All in a single queue takes 1.41 ms

Using 3 queues i would expect to get one third of that, assuming my GPU can process 3 wavefronts in parallel :) what i really get is 0.64 ms, which is not that bad.

 

Now i remove the memory barriers. Getting wrong results but just to look how costly they are:

1 q.: 0.12 ms

3 q.: 0.16 ms ...now, starting to get bad

 

Next test is with barriers but zero work dispatches, so doing nothing.

1 q.: 0.43 ms   wtf?

3 q.: 0.50 ms   i saw it coming

 

Next is zero work dispatches without barriers, doing absolutely nothing

1 q.: 0.12 ms

3 q.: 0.18 ms

 

Next, empty command buffers without any dispatch for reference

1 q.: 0 ms         what's that? driver optimizations?

3 q.: 0.14 ms    should we subtract this from all async timings? makes sense...

 

 

Some confusing numbers. (Ok - more at the unpractical cases but worth to mention).

 

Probably i should explain how i measure time:

Reading a GPU timestamp at start and end of each queue does not work here - they act like barriers and seem to disturb async compute.

So i read time with Windows timeGetTime(), vkQueueSubmit() * q, vkQueueWaitIdle() * q, read time again and accumulate for 100 frames, print average

So this includes API overhead.

 

Maybe i should increase the workload, but it's all about small workloads, and 100 dispatches is already more i'll use in practice.

The reasons behind my small / zero workloads problem are dependencies between processing a tree by level. (But the same issue arises on pretty much anything beyond per pixel brute force)

I use prebuild command buffers with one indirect dispatch per tree level.

An alternative would be to pop the work from a queue and busy waiting until a higher tree level has been processed: One dispatch for the whole tree, but busy waiting on GPU? Sounds very bad to me.

 

 

EDIT:

I get resonable timings after making dispatches indirect.

I may be even illegal to dispatch zero work directly, but that's what i've done with above numbers.

Edited by JoeJ

Share this post


Link to post
Share on other sites
Advertisement

I think you should actually use this:

 

https://www.khronos.org/registry/vulkan/specs/1.0/man/html/vkCmdWriteTimestamp.html

 

In any API when you want to work with GPU and time it, you need to use timer queries (basically request the device to capture timestamp at some point of execution in some sort of a buffer) - and read them back to operate with them, otherwise the timing result will be wrong and imprecise. So basically what you should do is - before adding your execution commands into command buffer, add timestamp write command. And do the same after you add commands into your command buffer. You can ten read those values on host - using vkQueryPoolResults or copy them into VkBuffer using vkCmdCopyQueryPoolResults

 

Note, there is actually some further description directly in specs:

 

https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html#queries-timestamps 

Edited by Vilem Otte

Share this post


Link to post
Share on other sites

Yep, seems an issue of measuring time and doing pointless tests.

I'm too lazy to mess around with QueryPerformanceTimer right now. Summing up hundrets of 1's and 0's and dividing by framecount should give the same result - it just takes longer.

 

But enabling my GPU timestamp profiler (which already uses the thing Vilem mentioned) suddenly seems to work properly.

Don't ask me why. I definitively have had issues before that async took langer than not and disabling timestamps fixed it.

 

 

 

Repeating the first test i get now:

 

 

3 queues async: 0.71ms from my timeGetTime approach including overhead (timestamps are expensive, so it's more now)

 

And here the timings for each queue from GPU timestamps at start and end of each queue:

queue0, 1, 2:   0.42ms,  0.44ms,  0.43ms

 

And the differnece between the lowest and highest timestamp ist: 0.48ms

 

 

 

1 queue, no async:

 

timeGetTime: 1.42 ms

queue from GPU: 1.30 ms

 

 

 

I don't dare to repeat the meaningless tests, but lets change dispacht count from 100,100,100 to 200,100,50 and focus on GPU timestamps only:

1q.: 1.49ms

3q.: 0.86ms (0.83, 0.44, 0.23)

 

Now, additionaly change dispatch parameters from (1,1,1) to (20,1,1) wavefronts

1q.: 1.38ms

3q.: 0.80ms (0.79, 0.39, 0.20)

 

Confused again that 3x20 parallel wavefronts are even faster than 3x1, but beside that numbers look very good :)

Note that single queue takes only 2 timestamps but 3 queues take 6, so there is some additional overhead.

Seems very close to optimal. Nice :)

Share this post


Link to post
Share on other sites

One more thing i've found out: Atomics on global memory do not work across differnt queues.

Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000.

Seems each queue gets its own cache, even if all of them operate on the same memory buffer.

Maybe this is the reason i still can't get an advantage from async compute beyond syntetic tests...

Edited by JoeJ

Share this post


Link to post
Share on other sites

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible (There is no other way to sync 2 command buffers. Or am i wrong?)

I have division like this:

 

A (0.1ms - 34 invocations, mostly zero or tiny workloads)

B (0.5ms - 16 invocations, starting with tiny, ending with heavy workloads)

C (0.5ms - final work, at this point i need results from both A and B)

 

So i can do A and B simultaneously. My goal is to hide runtime of A behind B, and this totally works.

 

Option 1:

queue1: process A

queue2: process B, wait on A, process C

 

Option 2:

queue1: process A

queue2: process B

queue3: wait on A and B, process C

 

And here is the problem:

No matter what option i use, after successfully doing A and B in the desired time of 0.5ms,

The GPU does nothing for about 0.15ms, and only after this gap it starts processing C.

 

0.15ms - that's a lot. Do you think that indicates a driver issue?

Do you see something else i could try?

 

I may prepare a small project for AMD to show them...

 

 

EDIT:

 

Maybe the timestamp measurements cause the gap.

Looking at the CPU timer, if i turn timestamps on / off the difference is 0.3 ms.

 

Unfortunately there is no way to be sure.

Hopefully AMDs upcoming profiling tool will clarify...

Edited by JoeJ

Share this post


Link to post
Share on other sites

Ok, assuming both queues start at the same time (difference usually is about 0.01ms) and using only 2 timestamps instead of 6, i finally get a win of 0.05ms.

Share this post


Link to post
Share on other sites

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible

So is that cause you get

One more thing i've found out: Atomics on global memory do not work across differnt queues. Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000. Seems each queue gets its own cache, even if all of them operate on the same memory buffer.
  

 

If different queue gets' its own cache, my assumption of GPU cache strategy will be totally wrong, and I have to rewrite a lot of my code..... :(  

Share this post


Link to post
Share on other sites

 

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible

So is that cause you get

 

 

One more thing i've found out: Atomics on global memory do not work across differnt queues. Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000. Seems each queue gets its own cache, even if all of them operate on the same memory buffer.
  

 

If different queue gets' its own cache, my assumption of GPU cache strategy will be totally wrong, and I have to rewrite a lot of my code..... :(  

 

 

I'm not sure if you get the right conclusions of this and if it affects your made decissions at all. What's your example?

The atomic inconsistentcy hints atomics are implemented on cache (which is good - otherwise they would be probably very slow), but i don't know if this inconsistentcy is hardware / API related and how the API specs handle it. Could be different for VK / DX12.

But anyways it should not lead to serious limitations in practice: Either put all dispatches that do atomics to the same memory in one queue, or if you really want to use async compute sync multiple queue to ensure visible results.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement