Sign in to follow this  

Vulkan Confusing performance with async compute

Recommended Posts

I started to do some testing about this with compute workloads only, so not the typical 'Do compute while rendering shadow maps' or similar.

Motivation is to keep the GPU busy in cases when a dispatch has little or zero work, which is often unavoidable.

 

I use Vulkan on FuryX, but i'm interested in any experience / opinions so i post it here.

 

Test shader uses 64 threads. Each Thread reads one float, then build prefix sum in LDS, writes back.

On API side do 100 dispatches, but each time only one thread group (so one wavefront), memory barrier between each dispatch to simulate dependcies.

Do all of this 3 times on 3 different memory buffers, so we could do each time async.

 

 

Here are the results:

 

All in a single queue takes 1.41 ms

Using 3 queues i would expect to get one third of that, assuming my GPU can process 3 wavefronts in parallel :) what i really get is 0.64 ms, which is not that bad.

 

Now i remove the memory barriers. Getting wrong results but just to look how costly they are:

1 q.: 0.12 ms

3 q.: 0.16 ms ...now, starting to get bad

 

Next test is with barriers but zero work dispatches, so doing nothing.

1 q.: 0.43 ms   wtf?

3 q.: 0.50 ms   i saw it coming

 

Next is zero work dispatches without barriers, doing absolutely nothing

1 q.: 0.12 ms

3 q.: 0.18 ms

 

Next, empty command buffers without any dispatch for reference

1 q.: 0 ms         what's that? driver optimizations?

3 q.: 0.14 ms    should we subtract this from all async timings? makes sense...

 

 

Some confusing numbers. (Ok - more at the unpractical cases but worth to mention).

 

Probably i should explain how i measure time:

Reading a GPU timestamp at start and end of each queue does not work here - they act like barriers and seem to disturb async compute.

So i read time with Windows timeGetTime(), vkQueueSubmit() * q, vkQueueWaitIdle() * q, read time again and accumulate for 100 frames, print average

So this includes API overhead.

 

Maybe i should increase the workload, but it's all about small workloads, and 100 dispatches is already more i'll use in practice.

The reasons behind my small / zero workloads problem are dependencies between processing a tree by level. (But the same issue arises on pretty much anything beyond per pixel brute force)

I use prebuild command buffers with one indirect dispatch per tree level.

An alternative would be to pop the work from a queue and busy waiting until a higher tree level has been processed: One dispatch for the whole tree, but busy waiting on GPU? Sounds very bad to me.

 

 

EDIT:

I get resonable timings after making dispatches indirect.

I may be even illegal to dispatch zero work directly, but that's what i've done with above numbers.

Edited by JoeJ

Share this post


Link to post
Share on other sites

I think you should actually use this:

 

https://www.khronos.org/registry/vulkan/specs/1.0/man/html/vkCmdWriteTimestamp.html

 

In any API when you want to work with GPU and time it, you need to use timer queries (basically request the device to capture timestamp at some point of execution in some sort of a buffer) - and read them back to operate with them, otherwise the timing result will be wrong and imprecise. So basically what you should do is - before adding your execution commands into command buffer, add timestamp write command. And do the same after you add commands into your command buffer. You can ten read those values on host - using vkQueryPoolResults or copy them into VkBuffer using vkCmdCopyQueryPoolResults

 

Note, there is actually some further description directly in specs:

 

https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html#queries-timestamps 

Edited by Vilem Otte

Share this post


Link to post
Share on other sites

Yep, seems an issue of measuring time and doing pointless tests.

I'm too lazy to mess around with QueryPerformanceTimer right now. Summing up hundrets of 1's and 0's and dividing by framecount should give the same result - it just takes longer.

 

But enabling my GPU timestamp profiler (which already uses the thing Vilem mentioned) suddenly seems to work properly.

Don't ask me why. I definitively have had issues before that async took langer than not and disabling timestamps fixed it.

 

 

 

Repeating the first test i get now:

 

 

3 queues async: 0.71ms from my timeGetTime approach including overhead (timestamps are expensive, so it's more now)

 

And here the timings for each queue from GPU timestamps at start and end of each queue:

queue0, 1, 2:   0.42ms,  0.44ms,  0.43ms

 

And the differnece between the lowest and highest timestamp ist: 0.48ms

 

 

 

1 queue, no async:

 

timeGetTime: 1.42 ms

queue from GPU: 1.30 ms

 

 

 

I don't dare to repeat the meaningless tests, but lets change dispacht count from 100,100,100 to 200,100,50 and focus on GPU timestamps only:

1q.: 1.49ms

3q.: 0.86ms (0.83, 0.44, 0.23)

 

Now, additionaly change dispatch parameters from (1,1,1) to (20,1,1) wavefronts

1q.: 1.38ms

3q.: 0.80ms (0.79, 0.39, 0.20)

 

Confused again that 3x20 parallel wavefronts are even faster than 3x1, but beside that numbers look very good :)

Note that single queue takes only 2 timestamps but 3 queues take 6, so there is some additional overhead.

Seems very close to optimal. Nice :)

Share this post


Link to post
Share on other sites

One more thing i've found out: Atomics on global memory do not work across differnt queues.

Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000.

Seems each queue gets its own cache, even if all of them operate on the same memory buffer.

Maybe this is the reason i still can't get an advantage from async compute beyond syntetic tests...

Edited by JoeJ

Share this post


Link to post
Share on other sites

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible (There is no other way to sync 2 command buffers. Or am i wrong?)

I have division like this:

 

A (0.1ms - 34 invocations, mostly zero or tiny workloads)

B (0.5ms - 16 invocations, starting with tiny, ending with heavy workloads)

C (0.5ms - final work, at this point i need results from both A and B)

 

So i can do A and B simultaneously. My goal is to hide runtime of A behind B, and this totally works.

 

Option 1:

queue1: process A

queue2: process B, wait on A, process C

 

Option 2:

queue1: process A

queue2: process B

queue3: wait on A and B, process C

 

And here is the problem:

No matter what option i use, after successfully doing A and B in the desired time of 0.5ms,

The GPU does nothing for about 0.15ms, and only after this gap it starts processing C.

 

0.15ms - that's a lot. Do you think that indicates a driver issue?

Do you see something else i could try?

 

I may prepare a small project for AMD to show them...

 

 

EDIT:

 

Maybe the timestamp measurements cause the gap.

Looking at the CPU timer, if i turn timestamps on / off the difference is 0.3 ms.

 

Unfortunately there is no way to be sure.

Hopefully AMDs upcoming profiling tool will clarify...

Edited by JoeJ

Share this post


Link to post
Share on other sites

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible

So is that cause you get

One more thing i've found out: Atomics on global memory do not work across differnt queues. Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000. Seems each queue gets its own cache, even if all of them operate on the same memory buffer.
  

 

If different queue gets' its own cache, my assumption of GPU cache strategy will be totally wrong, and I have to rewrite a lot of my code..... :(  

Share this post


Link to post
Share on other sites

 

I found the reason for my problem. To go async with Vulkan you need to divide your command buffer into multiple command buffers to make synchronization by semaphores possible

So is that cause you get

 

 

One more thing i've found out: Atomics on global memory do not work across differnt queues. Having each workgroup increasing the same value, i get fluctuating numbers between 4000 and 5000. The correct value would be 6000. Seems each queue gets its own cache, even if all of them operate on the same memory buffer.
  

 

If different queue gets' its own cache, my assumption of GPU cache strategy will be totally wrong, and I have to rewrite a lot of my code..... :(  

 

 

I'm not sure if you get the right conclusions of this and if it affects your made decissions at all. What's your example?

The atomic inconsistentcy hints atomics are implemented on cache (which is good - otherwise they would be probably very slow), but i don't know if this inconsistentcy is hardware / API related and how the API specs handle it. Could be different for VK / DX12.

But anyways it should not lead to serious limitations in practice: Either put all dispatches that do atomics to the same memory in one queue, or if you really want to use async compute sync multiple queue to ensure visible results.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Forum Statistics

    • Total Topics
      628638
    • Total Posts
      2983970
  • Similar Content

    • By mark_braga
      I am working on a project which needs to share render targets between Vulkan and DirectX12. I have enabled the external memory extension and now allocate the memory for the render targets by adding the VkExportMemoryInfoKHR to the pNext chain of VkMemoryAllocateInfo. Similarly I have added the VkExternalMemoryImageCreateInfo to the pNext chain of VkImageCreateInfo.
      After calling the get win32 handle function, I get some handle pointer which is not null (I assume it is valid).
      VkExternalMemoryImageCreateInfoKHR externalImageInfo = {}; if (gExternalMemoryExtensionKHR) { externalImageInfo.sType = VK_STRUCTURE_TYPE_EXTERNAL_MEMORY_IMAGE_CREATE_INFO_KHR; externalImageInfo.pNext = NULL; externalImageInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KH imageCreateInfo.pNext = &externalImageInfo; } vkCreateImage(...); VkExportMemoryAllocateInfoKHR exportInfo = { VK_STRUCTURE_TYPE_EXPORT_MEMORY_ALLOCATE_INFO_KHR }; exportInfo.handleTypes = VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_HEAP_BIT_KHR | VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; memoryAllocateInfo.pNext = &exportInfo; vkAllocateMemory(...); VkMemoryGetWin32HandleInfoKHR info = { VK_STRUCTURE_TYPE_MEMORY_GET_WIN32_HANDLE_INFO_KHR, NULL }; info.memory = pTexture->GetMemory(); info.handleType = VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D12_RESOURCE_BIT_KHR; VkResult res = vkGetMemoryWin32HandleKHR(vulkanDevice, &info, &pTexture->pSharedHandle); ASSERT(VK_SUCCESS == res); Now when I try to call OpenSharedHandle from a D3D12 device, it crashes inside nvwgf2umx.dll with the integer division by zero error.
      I am now lost and have no idea what the other handle types do.
      For example: How do we get the D3D12 resource from the VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT_KHR handle?
      I also found some documentation on this link but it doesn't help much.
      https://javadoc.lwjgl.org/org/lwjgl/vulkan/NVExternalMemoryWin32.html
      This is all assuming the extension works as expected since it has made it to the KHR
    • By dwatt
      I am trying to get vulkan on android working and have run into a big issue. I don't see any validation layers available. I have tried linking their libraries into mine but still no layers. I have tried compiling it into a library of my own but the headers for it are all over the place. Unfortunately , google's examples and tutorials are out of date and don't work for me. Any idea what I have to do to get those layers to work?
    • By mark_braga
      It seems like nobody really knows what is the correct behavior after window minimizes in Vulkan.
      I have looked at most of the examples (Sascha Willems, GPUOpen,...) and all of them crash after the window minimize event with the error VK_ERROR_OUT_OF_DATE either with an assertion during acquire image or after calling present. This is because we have to recreate the swap chain.
      I tried this but then Vulkan expects you to provide a swap chain with extents { 0, 0, 0, 0 }, but now if you try to set the viewport or create new image views with extents { 0, 0, 0, 0 }, Vulkan expects you to provide non-zero values. So now I am confused.
      Should we just do nothing after a window minimize event? No rendering, update, ...?
    • By mellinoe
      Hi all,
      First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
      Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
      The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
      Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
    • By GuyWithBeard
      Hi,
      In Vulkan you have render passes where you specify which attachments to render to and which to read from, and subpasses within the render pass which can depend on each other. If one subpass needs to finish before another can begin you specify that with a subpass dependency.
      In my engine I don't currently use subpasses as the concept of the "render pass" translates roughly to setting a render target and clearing it followed by a number of draw calls in DirectX, while there isn't really any good way to model subpasses in DX. Because of this, in Vulkan, my frame mostly consists of a number of render passes each with one subpass.
      My question is, do I have to specify dependencies between the render passes or is that needed only if you have multiple subpasses?
      In the Vulkan Programming Guide, chapter 13 it says: "In the example renderpass we set up in Chapter 7, we used a single subpass with no dependencies and a single set of outputs.”, which suggests that you only need dependencies between subpasses, not between render passes. However, the (excellent) tutorials at vulkan-tutorial.com have you creating a subpass dependency to "external subpasses" in the chapter on "Rendering and presentation", under "Subpass dependencies": https://vulkan-tutorial.com/Drawing_a_triangle/Drawing/Rendering_and_presentation even if they are using only one render pass with a single subpass.
      So, in short; If I have render pass A, with a single subpass, rendering to an attachment and render pass B, also with a single subpass, rendering to that same attachment, do I have to specify subpass dependencies between the two subpasses of the render passes, in order to make render pass A finish before B can begin, or are they handled implicitly by the fact that they belong to different render passes?
      Thanks!
  • Popular Now