Very informative, thanks.
I didn't know that registers are not allocated dynamically per subroutine. Is this likely to stay this way for a long time?
What do you think about grouping subroutines in several shaders based on complexity, then, couldn't that be a good compromise?
I don't see it changing any time soon; the registers are divided up between instances when a program starts so the hardware can know how many it can keep in flight at once. If you make that dynamic in some way you both increase the cost of execution as we can't grab all resources up front so during program flow we have to try and grab more resources and then potentially fail if the resources aren't there.
Lets say the shader byte stream and thus hardware knew of sub-routine points then it might work like this.
If you had a shader made up of a main and 2 sub routines;
main - 5 GPR
func1 - 10 GPR
func2 - 20 GPR
So when execution starts the hardware gets told we need 5 GPR, so it figures out how many instances it can launch and off we go.
Then you pick a sub-routine but hold on; where do we get our GPRs from? We allocated them all up front to spawn our initial work load... bugger. At this point we deadlock and the GPU locks up.
At this point you've got two choices;
1) Attempt to stream the state out and reallocate - this would be slow as you'd need to stream to main memory, it would halt any processing while this is going on, and then you'd take state setup cost again while you now reconfigured to have half as many tasks running at once (for func1), or even 1/4 as many (func2). Then when the subroutines return you have to restore main's state, setup again, pull all the tasks back together and relaunch in the old configuration.
(I've probably missed some problems as wavefront/warps could be on the same SIMD unit, thus sharing the register file, but executing different paths so you run a performance risk again when something with a higher GPR count need space but not enough register space is free; whole wavefronts/warps end up sleeping at this point which could hurt performance).
2) Current system of pre-allocating registers in advance so you run a lower number of instances at once but you don't have to have any complicated hardware logic for rescheduling workloads as the shaders progress.
In theory 1 would be the 'ideal' situation as at any given time you are running the maximum number of instances but the dynamic nature of it is likely to be a performance issue going forward with all the extra work needed to rework the threads in flight.
If you've got the CPU time to spare then grouping is potentially a win, as long as you don't end up with subroutines with wildly different GPR counts it could help matters.