• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Prune

Overhead of subroutines in arrays when multidrawing?

9 posts in this topic

Most of my drawing is with glMultiDrawElementsIndirectCountARB(), but I'm still splitting up the draw calls between groups of objects with different material types, in order to bind different shader programs. I've been considering just indexing by object ID into an array of subroutines so I don't have to switch shaders and thus have only one draw call per pass, but I'm wondering if an array of subroutines would have significant overhead, as it would be combining the additional dereference (the array) with a function call that can't be inlined (subroutine). Is this a realistic concern, or is it likely to be negligible?

The other issue I'm worried about is that different shaders in general require different inputs/outputs, and I'm not sure how much performance would be wasted by interpolation of input/output pairs that are unused by given subroutines.

Edited by Prune
0

Share this post


Link to post
Share on other sites

Very informative, thanks.

I didn't know that registers are not allocated dynamically per subroutine. Is this likely to stay this way for a long time?

What do you think about grouping subroutines in several shaders based on complexity, then, couldn't that be a good compromise?

 

You mention variable interpolation is likely done in software on newer cards. Do we expect that eventually texture filtering will also be done that way? A common optimization in convolution shaders this far has been in relying on bilinear interpolation to reduce the number of samples, but if this ends up being shader code anyway, then there'd be no point to bother with the added complexity (in terms of calculation of the sample locations and weights).

0

Share this post


Link to post
Share on other sites

The thing is, since we're not talking about indexing using an arbitrary value, such as an ID read from a texture, the GLSL runtime _does_ have the information that it needs to optimally allocate registers _per draw_. Indexes dependent solely on gl_DrawIDARB (and gl_InstanceID) are, by the spec, dynamically uniform expressions. Surely there is already some sort of runtime partial specialization of shaders--else why have the defined concept of dynamically uniform expressions (which are constant within a work group) at all? So why can't register allocations that depend on subroutine selection that's dependend on not just constant expressions, but dynamically uniform expressions as well, be part of that specialization?

Edited by Prune
0

Share this post


Link to post
Share on other sites
That depends on the source of the data.

If your subroutine switch was provided by a glUniform call pre-draw then yes, the driver can see the value and will likely recompile a shadow copy of your GLSL program which removes the jump and inlines the code. You've now got the best of both worlds (post-recompile) as your register allocation is now constant, the jump is gone and as the user you've not had to write N-versions of the code and don't see or care about this driver based magic going on behind the scenes.

The problem with gl_DrawID and gl_InstanceID is right there in what you wrote however; "dynamically uniform expressions".
Neither of these values is visible to the driver pre-command buffer build so it has no way of knowing what they are to do the shadow recompiles.

Certainly in the case where indexing on gl_InstanceID is used because this doesn't vary per command kick/gpu state but during the execution of a bunch of work. The driver would have to evaluate the location gl_InstanceID is used, look at the draw call to figure out the extent, then shadow compile N copies of the shader and generate a command buffer doing N kicks for the draw with a different program for each instance (or group of instances, depending on how it is used).

Now, gl_DrawID might be a bit more feasible.
If you are using the non-indirect count version, so count comes from a CPU side parameter, then the driver could potentially shadow compile up to 'count' versions of the shader (depending on usage; driver would still need to look into the shader to see how it is used) and then issue multiple kick commands with a different shader for each draw.

Once you get in to the indirect versions however life would get much harder; while the CPU can see a 'max count' it cant know how many draws the command processor on the GPU will end up setting up. So unless the command processor has a set of instructions which allow it to index into an array of program binaries to use (which would have to be 'max count' in size and consist of pointers to the programs) it has no way to route this information so any choices would be 'late'.

So, in some cases it might be possible to do this HOWEVER it will come at a much greater CPU cost as you have to perform much more complex work up front in the driver for the general case of generating the command buffer for the GPU to consume. In the case of instancing it would basically undermine instancing; in the non-indirect multi-draw case it might help as I believe these are generally handled as N kick commands under the hood anyway but for anything sourcing data off a GPU buffer it could be impossible.

But, it comes at the cost of increased memory usage & more driver complexity as it has to evaluate the shader and make choices which increases CPU usage before we even get to the more complicated dispatch logic.

Depending on how it is implemented it could also cause GPU performance issues as instead of a large block of many wavefronts/warps moving though the system you could now have smaller ones doing less work but with worst occupancy and/or less chance for latency hiding depending on how the work is spread across the GPU's compute units.

Now, that's not to say the drivers don't already do some of this; the trival glUniform case is probably handled, you might even get the multi-draw case too although it seems less likely. However I wouldn't count on it.
2

Share this post


Link to post
Share on other sites

Great theoretical talks, but as no one provided any experimental data I'll chime in.

 

I tested this out in hopes of getting better performance to switching out around 80 different shaders which are essentially just one uber shader compiled with various macros. Some of the macros include VS Skinning, diffuse texture sampling, alpha testing, lighting calculations (if it has a normal) and ofcourse interpolating more stuff just in case it has normals or texcoords (though most of my objects did).

 

So my shaders were doing like a VS Skinning uniform in which I provided 2 versions : one in which skinning takes place, multiplying 4 matrices and one in which nothing was happening. For Texturing, the same, one subroutine would sample a texture, the other would do nothing.

 

My results on an AMD HD7850 was that I had 5% less performance with a single program compared to ~80 different variants, same number of draw calls, multiple glUniformSubroutine calls per object, and ~100 objects. I don't know if the perf degradation came from using the extra glUniformSubroutine uniform as these are rather odd, as once you use them for a draw call, you have to call the function again before you use them again because it forgets state ( they say it's a D3D11 artifact ).

 

However, this was just one way to use them, as micro jumps don't seem faster than multiple shader switching. But I'm thinking to use them in a different way in the future, just make one subroutine function variant be an entire VS/PS/etc and then index that based on DrawID or similar and unify drawcalls with different shaders. Adding bindless textures to the mix and merging VB/IB/CB together would make it possible to have just a single draw call per scene :).

0

Share this post


Link to post
Share on other sites
The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.
2

Share this post


Link to post
Share on other sites

The problem with saying '5% less performance' is we don't know if you were CPU or GPU bound in your tests.

If you were CPU bound then it's possible the extra calls made the difference.
If you were GPU bound then it might be a GPU resource allocation problem.
Or it could be anything in between smile.png

If you could run the tests via AMD's profiling tools and get some solid numbers that would be useful.
Even their off-line shader analyser might give some decent guideline numbers as that'll tell you VGPR/SGPR usage and you can dig from there to see what is going on.

 

I'm GPU bound as I barely do anything on the CPU and I usually have near 99% GPU Usage. I was having 95%+ GPU Usage with my old HD7850 when I did the test but now with my 280X I have around 70% GPU usage, same scene, same code, DX11 code is still at 99%+. Thanks for the heads up with the tools, I'll check them out whenever I can experiment with that code :).

0

Share this post


Link to post
Share on other sites
That's good to know and does tend to fall roughly in line with what I've heard floating around GPU wise.

Out of interest what kind of performance are you getting with the DX11 path? (percentage faster/slower would do)
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0