• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.

      View full story
    • By khawk
      LunarG has released new Vulkan SDKs for Windows, Linux, and macOS based on the 1.1.73 header. The new SDK includes:
      New extensions: VK_ANDROID_external_memory_android_hardware_buffer VK_EXT_descriptor_indexing VK_AMD_shader_core_properties VK_NV_shader_subgroup_partitioned Many bug fixes, increased validation coverage and accuracy improvements, and feature additions Developers can download the SDK from LunarXchange at https://vulkan.lunarg.com/sdk/home.
    • By mark_braga
      I have a pretty good experience with multi gpu programming in D3D12. Now looking at Vulkan, although there are a few similarities, I cannot wrap my head around a few things due to the extremely sparse documentation (typical Khronos...)
      In D3D12 -> You create a resource on GPU0 that is visible to GPU1 by setting the VisibleNodeMask to (00000011 where last two bits set means its visible to GPU0 and GPU1)
      In Vulkan - I can see there is the VkBindImageMemoryDeviceGroupInfoKHR struct which you add to the pNext chain of VkBindImageMemoryInfoKHR and then call vkBindImageMemory2KHR. You also set the device indices which I assume is the same as the VisibleNodeMask except instead of a mask it is an array of indices. Till now it's fine.
      Let's look at a typical SFR scenario:  Render left eye using GPU0 and right eye using GPU1
      You have two textures. pTextureLeft is exclusive to GPU0 and pTextureRight is created on GPU1 but is visible to GPU0 so it can be sampled from GPU0 when we want to draw it to the swapchain. This is in the D3D12 world. How do I map this in Vulkan? Do I just set the device indices for pTextureRight as { 0, 1 }
      Now comes the command buffer submission part that is even more confusing.
      There is the struct VkDeviceGroupCommandBufferBeginInfoKHR. It accepts a device mask which I understand is similar to creating a command list with a certain NodeMask in D3D12.
      So for GPU1 -> Since I am only rendering to the pTextureRight, I need to set the device mask as 2? (00000010)
      For GPU0 -> Since I only render to pTextureLeft and finally sample pTextureLeft and pTextureRight to render to the swap chain, I need to set the device mask as 1? (00000001)
      The same applies to VkDeviceGroupSubmitInfoKHR?
      Now the fun part is it does not work  . Both command buffers render to the textures correctly. I verified this by reading back the textures and storing as png. The left texture is sampled correctly in the final composite pass. But I get a black in the area where the right texture should appear. Is there something that I am missing in this? Here is a code snippet too
      void Init() { RenderTargetInfo info = {}; info.pDeviceIndices = { 0, 0 }; CreateRenderTarget(&info, &pTextureLeft); // Need to share this on both GPUs info.pDeviceIndices = { 0, 1 }; CreateRenderTarget(&info, &pTextureRight); } void DrawEye(CommandBuffer* pCmd, uint32_t eye) { // Do the draw // Begin with device mask depending on eye pCmd->Open((1 << eye)); // If eye is 0, we need to do some extra work to composite pTextureRight and pTextureLeft if (eye == 0) { DrawTexture(0, 0, width * 0.5, height, pTextureLeft); DrawTexture(width * 0.5, 0, width * 0.5, height, pTextureRight); } // Submit to the correct GPU pQueue->Submit(pCmd, (1 << eye)); } void Draw() { DrawEye(pRightCmd, 1); DrawEye(pLeftCmd, 0); }  
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By Alexa Savchenko
      I publishing for manufacturing our ray tracing engines and products on graphics API (C++, Vulkan API, GLSL460, SPIR-V): https://github.com/world8th/satellite-oem
      For end users I have no more products or test products. Also, have one simple gltf viewer example (only source code).
      In 2016 year had idea for replacement of screen space reflections, but in 2018 we resolved to finally re-profile project as "basis of render engine". In Q3 of 2017 year finally merged to Vulkan API. 
       
       
  • Advertisement
  • Advertisement
Sign in to follow this  

Vulkan How does compute shader code size affect performance

This topic is 552 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have just added some code to a shader to distribute work to idle threads - it should be a win but is a slow down.

When i put a condition around it so it executes only in 1% of all workgroups, i still get the same slow down.

 

I see two options:

 

The additional code increases used register count and so reduces occupancy.

Unfortunately i have no way to detect this with Vulkan, but it's very unlikely.

Please let me know if you know a tool to get this information.

 

Or the shader became simply too large.

I think this is what's happening, and i also think that's the reason why i often get better performence when making it impossible for the compiler to unroll loops.

But i can't find any documentation on code size penalties (i'm mainly interested on GCN for now).

 

 

So, have you experienced similar issues?

Share this post


Link to post
Share on other sites
Advertisement

It's a bit hard to tell from your general description... Can you post some code of the theoretical-but-not-practical optimization for us to deconstruct and theory-craft about?

Share this post


Link to post
Share on other sites

Ok, so this is the code i've added causing my issue.

Similar code gave me a speedup of 20% in an older version of my shader.

This code is at a point where complex math is done and VGPR usage should be low (guess 10), and i don't think it increases register usage at all if compiler is clever enough.

Simply removing it would be good and fast enough, but... i hate idle threads :)

I will try to replace a complex math code block with a lookup. I bet after that the work distribution increases performance...

 

EDIT: the entire shader has about 700 lines

// goal is to distribute work to idle threads, also split large workloads first to reduce work divergence
 
        // _counter is local ("shared") LDS uint initialized to zero
        // packed is a VGPR containig work description and other unrelevant bits
        // hasWork is boolean in VGPR
        // lID is current thread ID

        uint worklessThreadSlot = 0x10000; // large number to save a branch later
        if (!hasWork) worklessThreadSlot = atomic_add(ADRS _counter, 1); // thread with no work gets an index
        else packed |= lID<<8; // this link to orignal work spender thread will be copied along for later data transfer
        BARRIER_LOCAL

        uint availableCount = _counter; // idle thread count
        BARRIER_LOCAL

if (availableCount > (WG_WIDTH * 13/14)) // this is the condition i've added to test how it affects performance if it's executed only rarely
{
        _counter = 0;
        BARRIER_LOCAL

        uint maxWork = large constant
        for (;;) // top down method - first split threads having large amounts of work, then shrink maxWork threshold to distribute smaller workloads
        {
            uint firstWorkReceiver = _counter;
            BARRIER_LOCAL

            uint work = packed;

            bool split = (work & 0xFFFF0000) > maxWork;
            if (split)
            {
                uint newWorker = atomic_add(_counter, 1);
                if (newWorker < availableCount)
                {
                    packed = modify to do only half of the work
                    _exchangeLDS[newWorker] = move other half of work to LDS so another thread can grab it (all this is 10 lines of simple bit manipulation code)
                }
            }

            BARRIER_LOCAL

            // update register of work reveiving thread

            bool isNewWorker = (worklessThreadSlot >= firstWorkReceiver
                             && worklessThreadSlot < min(_counter, availableCount));
            if (isNewWorker)
            {
                packed = _exchangeLDS[worklessThreadSlot];
                hasWork = true; // now this thread knows about it's received work and is ready to subdivide again
            }

            maxWork >>= 1; // shrink threshold

            if ((maxWork <= small constant) || // subdivision fine enough
                (_counter >= availableCount) || // out of idle threads
                (_counter == firstWorkReceiver)) // nothing found to subivide
                break;
        }
        BARRIER_LOCAL


#if 1 // distribution completed, need to copy some other registers data through LDS. cost: 0.05 ms    cost of entire code is 0.1 ms

        bool isWorkReceiver = (worklessThreadSlot < min(_counter, availableCount));
        uint srcIndex = (packed>>8) & 0xFF;

        // repeat copy operation like this for 3 VGPRs (2 x vec4 and 1 x uint)
        {
            _exchangeLDS[lID] = original thread VGPR data (in total i copy 2 x float4 + 1 uint data this way...)
            BARRIER_LOCAL
            if (isWorkReceiver) receiving VGPR data = _exchangeLDS[srcIndex];
        }
#endif
}

        // continue doing the work...

Idependent of that code block, i've often had the feeling that adding code caused slowdowns.

With OpenCL and CodeXL i saw nothing bad like register / LDS / bandwidth increase or occupancy decrease - it's just like "Add one more line of code and you tier down".

But i'm just guessing and would like to know for sure.

Edited by JoeJ

Share this post


Link to post
Share on other sites

I'd insept the hardware first in terms of cache and memory.

Does it load from cache? is the memory aligned properly? 

I don't think you run out of code segments or something like that, I do think it's the way it loads conditional structures.

Maybe some conditional optimization causes a bug. 

Maybe the specific driver causes these slowdowns, I'm not really a vulkan expert so I can't tell you for sure. 

 

Try to seperate code and look what gets you the most slowdown. 

Personally, I'm worried from this kind of code :

for (;;)

Share this post


Link to post
Share on other sites

The involved memory is LDS only, no cache / alignment issues.

 

The for loop executes 4 times at maximum, i use the (;;) to prevent unrolling here. There is no working #pragma yet for Vulkan.

(I keep such things configurable with #ifdefs - e.g. in my old shader unrolling was a win for 256 thread workgroups, but a loss for 128)

 

Try to seperate code and look what gets you the most slowdown.

 

I did a comparision with the older version of my shader where the code section was a win. Surprise: It takes the same time there.

Assumption: Old shader has less idle threads than new shader, so work distribution should be an even bigger win for the new shader. Yummy...

Reality: Because work processing has been optimized well, making idle threads busy is not worth the effort anymore? Really?

 

No - it can't be that simple, because this does not explain the slowdown even if i put it a condition to make sure it is executed absolutely never.

 

Arrrgh - please AMD, give us a tool to inspect Vulkan register usage and occupancy... this guessing drives me crazy.

So i'll continue in OpenCL, reduce register usage there and hope Vulkan will benefit from those changes. Perfect workflow...  :|

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement