Overhead of subroutines in arrays when multidrawing?

Started by
8 comments, last by _the_phantom_ 9 years, 10 months ago

Most of my drawing is with glMultiDrawElementsIndirectCountARB(), but I'm still splitting up the draw calls between groups of objects with different material types, in order to bind different shader programs. I've been considering just indexing by object ID into an array of subroutines so I don't have to switch shaders and thus have only one draw call per pass, but I'm wondering if an array of subroutines would have significant overhead, as it would be combining the additional dereference (the array) with a function call that can't be inlined (subroutine). Is this a realistic concern, or is it likely to be negligible?

The other issue I'm worried about is that different shaders in general require different inputs/outputs, and I'm not sure how much performance would be wasted by interpolation of input/output pairs that are unused by given subroutines.

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
Advertisement
This is a classic CPU vs GPU trade off.
If your frame-rate is limited by your CPU (i.e. CPU time per frame > GPU time per frame), and you're desperate to claw back some milliseconds on the CPU side, then this will let you greatly reduce calls into GL, and work done by the drivers background threads.
If you're limited by the GPU though, then this will most likely be a huge anti-optimization.

The GPU cost will depend on the generation/model. I guess only fairly modern generations will support this method of drawing in the first place though ;)

Assuming that modern cards can branch cheaply, a bigger performance concern will be register pressure.
When your program is compiled, it ends up requiring some number of registers to hold the state of a running program (inputs, temp variables, etc). The GPU will have a fixed number of these registers available -- say the GPU has 1000 and your shader uses 10: this means the GPU can be running 100 executions (pixels, vertices, etc) of your shader at the same time. This means that when one stalls (waiting for memory, etc), it can switch to one of the other 99 instances and do some work there.
Less registers used by your shader == more 'hyperthreading' == memory accesses seem faster.

With an übershader like this, the register requirement for the shader will be the worst case of all the individual sub-programs that make it up. If most of the sub-programs only require 10 registers, but one requires 100 registers, then too bad, no matter which sub-program is being used, the GPU can only ever do 10x 'hyperthreading' instead of 100x... Which can be a huge performance cost on modern cards.

As for the unused interpolated variables - imagine if the inputs to the pixel shader were the pixel's barycentric coordinates plus the output values of all 3 vertices, and you had to implement the interpolation yourself.
Newer cards are likely to do it this way (the driver generating pixel-shader interpolation code) rather than using fixed-function interpolation hardware.

Very informative, thanks.

I didn't know that registers are not allocated dynamically per subroutine. Is this likely to stay this way for a long time?

What do you think about grouping subroutines in several shaders based on complexity, then, couldn't that be a good compromise?

You mention variable interpolation is likely done in software on newer cards. Do we expect that eventually texture filtering will also be done that way? A common optimization in convolution shaders this far has been in relying on bilinear interpolation to reduce the number of samples, but if this ends up being shader code anyway, then there'd be no point to bother with the added complexity (in terms of calculation of the sample locations and weights).

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)

Very informative, thanks.
I didn't know that registers are not allocated dynamically per subroutine. Is this likely to stay this way for a long time?
What do you think about grouping subroutines in several shaders based on complexity, then, couldn't that be a good compromise?


I don't see it changing any time soon; the registers are divided up between instances when a program starts so the hardware can know how many it can keep in flight at once. If you make that dynamic in some way you both increase the cost of execution as we can't grab all resources up front so during program flow we have to try and grab more resources and then potentially fail if the resources aren't there.

Lets say the shader byte stream and thus hardware knew of sub-routine points then it might work like this.
If you had a shader made up of a main and 2 sub routines;
main - 5 GPR
func1 - 10 GPR
func2 - 20 GPR

So when execution starts the hardware gets told we need 5 GPR, so it figures out how many instances it can launch and off we go.
Then you pick a sub-routine but hold on; where do we get our GPRs from? We allocated them all up front to spawn our initial work load... bugger. At this point we deadlock and the GPU locks up.
At this point you've got two choices;

1) Attempt to stream the state out and reallocate - this would be slow as you'd need to stream to main memory, it would halt any processing while this is going on, and then you'd take state setup cost again while you now reconfigured to have half as many tasks running at once (for func1), or even 1/4 as many (func2). Then when the subroutines return you have to restore main's state, setup again, pull all the tasks back together and relaunch in the old configuration.
(I've probably missed some problems as wavefront/warps could be on the same SIMD unit, thus sharing the register file, but executing different paths so you run a performance risk again when something with a higher GPR count need space but not enough register space is free; whole wavefronts/warps end up sleeping at this point which could hurt performance).

2) Current system of pre-allocating registers in advance so you run a lower number of instances at once but you don't have to have any complicated hardware logic for rescheduling workloads as the shaders progress.

In theory 1 would be the 'ideal' situation as at any given time you are running the maximum number of instances but the dynamic nature of it is likely to be a performance issue going forward with all the extra work needed to rework the threads in flight.

If you've got the CPU time to spare then grouping is potentially a win, as long as you don't end up with subroutines with wildly different GPR counts it could help matters.

The thing is, since we're not talking about indexing using an arbitrary value, such as an ID read from a texture, the GLSL runtime _does_ have the information that it needs to optimally allocate registers _per draw_. Indexes dependent solely on gl_DrawIDARB (and gl_InstanceID) are, by the spec, dynamically uniform expressions. Surely there is already some sort of runtime partial specialization of shaders--else why have the defined concept of dynamically uniform expressions (which are constant within a work group) at all? So why can't register allocations that depend on subroutine selection that's dependend on not just constant expressions, but dynamically uniform expressions as well, be part of that specialization?

"But who prays for Satan? Who, in eighteen centuries, has had the common humanity to pray for the one sinner that needed it most?" --Mark Twain

~~~~~~~~~~~~~~~Looking for a high-performance, easy to use, and lightweight math library? http://www.cmldev.net/ (note: I'm not associated with that project; just a user)
That depends on the source of the data.

If your subroutine switch was provided by a glUniform call pre-draw then yes, the driver can see the value and will likely recompile a shadow copy of your GLSL program which removes the jump and inlines the code. You've now got the best of both worlds (post-recompile) as your register allocation is now constant, the jump is gone and as the user you've not had to write N-versions of the code and don't see or care about this driver based magic going on behind the scenes.

The problem with gl_DrawID and gl_InstanceID is right there in what you wrote however; "dynamically uniform expressions".
Neither of these values is visible to the driver pre-command buffer build so it has no way of knowing what they are to do the shadow recompiles.

Certainly in the case where indexing on gl_InstanceID is used because this doesn't vary per command kick/gpu state but during the execution of a bunch of work. The driver would have to evaluate the location gl_InstanceID is used, look at the draw call to figure out the extent, then shadow compile N copies of the shader and generate a command buffer doing N kicks for the draw with a different program for each instance (or group of instances, depending on how it is used).

Now, gl_DrawID might be a bit more feasible.
If you are using the non-indirect count version, so count comes from a CPU side parameter, then the driver could potentially shadow compile up to 'count' versions of the shader (depending on usage; driver would still need to look into the shader to see how it is used) and then issue multiple kick commands with a different shader for each draw.

Once you get in to the indirect versions however life would get much harder; while the CPU can see a 'max count' it cant know how many draws the command processor on the GPU will end up setting up. So unless the command processor has a set of instructions which allow it to index into an array of program binaries to use (which would have to be 'max count' in size and consist of pointers to the programs) it has no way to route this information so any choices would be 'late'.

So, in some cases it might be possible to do this HOWEVER it will come at a much greater CPU cost as you have to perform much more complex work up front in the driver for the general case of generating the command buffer for the GPU to consume. In the case of instancing it would basically undermine instancing; in the non-indirect multi-draw case it might help as I believe these are generally handled as N kick commands under the hood anyway but for anything sourcing data off a GPU buffer it could be impossible.

But, it comes at the cost of increased memory usage & more driver complexity as it has to evaluate the shader and make choices which increases CPU usage before we even get to the more complicated dispatch logic.

Depending on how it is implemented it could also cause GPU performance issues as instead of a large block of many wavefronts/warps moving though the system you could now have smaller ones doing less work but with worst occupancy and/or less chance for latency hiding depending on how the work is spread across the GPU's compute units.

Now, that's not to say the drivers don't already do some of this; the trival glUniform case is probably handled, you might even get the multi-draw case too although it seems less likely. However I wouldn't count on it.

Great theoretical talks, but as no one provided any experimental data I'll chime in.

I tested this out in hopes of getting better performance to switching out around 80 different shaders which are essentially just one uber shader compiled with various macros. Some of the macros include VS Skinning, diffuse texture sampling, alpha testing, lighting calculations (if it has a normal) and ofcourse interpolating more stuff just in case it has normals or texcoords (though most of my objects did).

So my shaders were doing like a VS Skinning uniform in which I provided 2 versions : one in which skinning takes place, multiplying 4 matrices and one in which nothing was happening. For Texturing, the same, one subroutine would sample a texture, the other would do nothing.

My results on an AMD HD7850 was that I had 5% less performance with a single program compared to ~80 different variants, same number of draw calls, multiple glUniformSubroutine calls per object, and ~100 objects. I don't know if the perf degradation came from using the extra glUniformSubroutine uniform as these are rather odd, as once you use them for a draw call, you have to call the function again before you use them again because it forgets state ( they say it's a D3D11 artifact ).

However, this was just one way to use them, as micro jumps don't seem faster than multiple shader switching. But I'm thinking to use them in a different way in the future, just make one subroutine function variant be an entire VS/PS/etc and then index that based on DrawID or similar and unify drawcalls with different shaders. Adding bindless textures to the mix and merging VB/IB/CB together would make it possible to have just a single draw call per scene :).

Relative Games - My apps

The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.

The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.

I'm GPU bound as I barely do anything on the CPU and I usually have near 99% GPU Usage. I was having 95%+ GPU Usage with my old HD7850 when I did the test but now with my 280X I have around 70% GPU usage, same scene, same code, DX11 code is still at 99%+. Thanks for the heads up with the tools, I'll check them out whenever I can experiment with that code :).

Relative Games - My apps

That's good to know and does tend to fall roughly in line with what I've heard floating around GPU wise.

Out of interest what kind of performance are you getting with the DX11 path? (percentage faster/slower would do)

This topic is closed to new replies.

Advertisement