Advertisement Jump to content
Sign in to follow this  

Vulkan How does compute shader code size affect performance

This topic is 826 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have just added some code to a shader to distribute work to idle threads - it should be a win but is a slow down.

When i put a condition around it so it executes only in 1% of all workgroups, i still get the same slow down.


I see two options:


The additional code increases used register count and so reduces occupancy.

Unfortunately i have no way to detect this with Vulkan, but it's very unlikely.

Please let me know if you know a tool to get this information.


Or the shader became simply too large.

I think this is what's happening, and i also think that's the reason why i often get better performence when making it impossible for the compiler to unroll loops.

But i can't find any documentation on code size penalties (i'm mainly interested on GCN for now).



So, have you experienced similar issues?

Share this post

Link to post
Share on other sites

It's a bit hard to tell from your general description... Can you post some code of the theoretical-but-not-practical optimization for us to deconstruct and theory-craft about?

Share this post

Link to post
Share on other sites

Ok, so this is the code i've added causing my issue.

Similar code gave me a speedup of 20% in an older version of my shader.

This code is at a point where complex math is done and VGPR usage should be low (guess 10), and i don't think it increases register usage at all if compiler is clever enough.

Simply removing it would be good and fast enough, but... i hate idle threads :)

I will try to replace a complex math code block with a lookup. I bet after that the work distribution increases performance...


EDIT: the entire shader has about 700 lines

// goal is to distribute work to idle threads, also split large workloads first to reduce work divergence
        // _counter is local ("shared") LDS uint initialized to zero
        // packed is a VGPR containig work description and other unrelevant bits
        // hasWork is boolean in VGPR
        // lID is current thread ID

        uint worklessThreadSlot = 0x10000; // large number to save a branch later
        if (!hasWork) worklessThreadSlot = atomic_add(ADRS _counter, 1); // thread with no work gets an index
        else packed |= lID<<8; // this link to orignal work spender thread will be copied along for later data transfer

        uint availableCount = _counter; // idle thread count

if (availableCount > (WG_WIDTH * 13/14)) // this is the condition i've added to test how it affects performance if it's executed only rarely
        _counter = 0;

        uint maxWork = large constant
        for (;;) // top down method - first split threads having large amounts of work, then shrink maxWork threshold to distribute smaller workloads
            uint firstWorkReceiver = _counter;

            uint work = packed;

            bool split = (work & 0xFFFF0000) > maxWork;
            if (split)
                uint newWorker = atomic_add(_counter, 1);
                if (newWorker < availableCount)
                    packed = modify to do only half of the work
                    _exchangeLDS[newWorker] = move other half of work to LDS so another thread can grab it (all this is 10 lines of simple bit manipulation code)


            // update register of work reveiving thread

            bool isNewWorker = (worklessThreadSlot >= firstWorkReceiver
                             && worklessThreadSlot < min(_counter, availableCount));
            if (isNewWorker)
                packed = _exchangeLDS[worklessThreadSlot];
                hasWork = true; // now this thread knows about it's received work and is ready to subdivide again

            maxWork >>= 1; // shrink threshold

            if ((maxWork <= small constant) || // subdivision fine enough
                (_counter >= availableCount) || // out of idle threads
                (_counter == firstWorkReceiver)) // nothing found to subivide

#if 1 // distribution completed, need to copy some other registers data through LDS. cost: 0.05 ms    cost of entire code is 0.1 ms

        bool isWorkReceiver = (worklessThreadSlot < min(_counter, availableCount));
        uint srcIndex = (packed>>8) & 0xFF;

        // repeat copy operation like this for 3 VGPRs (2 x vec4 and 1 x uint)
            _exchangeLDS[lID] = original thread VGPR data (in total i copy 2 x float4 + 1 uint data this way...)
            if (isWorkReceiver) receiving VGPR data = _exchangeLDS[srcIndex];

        // continue doing the work...

Idependent of that code block, i've often had the feeling that adding code caused slowdowns.

With OpenCL and CodeXL i saw nothing bad like register / LDS / bandwidth increase or occupancy decrease - it's just like "Add one more line of code and you tier down".

But i'm just guessing and would like to know for sure.

Edited by JoeJ

Share this post

Link to post
Share on other sites

I'd insept the hardware first in terms of cache and memory.

Does it load from cache? is the memory aligned properly? 

I don't think you run out of code segments or something like that, I do think it's the way it loads conditional structures.

Maybe some conditional optimization causes a bug. 

Maybe the specific driver causes these slowdowns, I'm not really a vulkan expert so I can't tell you for sure. 


Try to seperate code and look what gets you the most slowdown. 

Personally, I'm worried from this kind of code :

for (;;)

Share this post

Link to post
Share on other sites

The involved memory is LDS only, no cache / alignment issues.


The for loop executes 4 times at maximum, i use the (;;) to prevent unrolling here. There is no working #pragma yet for Vulkan.

(I keep such things configurable with #ifdefs - e.g. in my old shader unrolling was a win for 256 thread workgroups, but a loss for 128)


Try to seperate code and look what gets you the most slowdown.


I did a comparision with the older version of my shader where the code section was a win. Surprise: It takes the same time there.

Assumption: Old shader has less idle threads than new shader, so work distribution should be an even bigger win for the new shader. Yummy...

Reality: Because work processing has been optimized well, making idle threads busy is not worth the effort anymore? Really?


No - it can't be that simple, because this does not explain the slowdown even if i put it a condition to make sure it is executed absolutely never.


Arrrgh - please AMD, give us a tool to inspect Vulkan register usage and occupancy... this guessing drives me crazy.

So i'll continue in OpenCL, reduce register usage there and hope Vulkan will benefit from those changes. Perfect workflow...  :|

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using, you agree to our community Guidelines, Terms of Use, and Privacy Policy. is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!