• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
ill

Why use Uber Shaders? Why not one giant shader with some ifs.

21 posts in this topic

I'm not sure how it works in the world of shaders but in CUDA, when you have branches, but all threads take the same branch, as I understand there is no slow down. The slowdown occurs from divergence as some of the processors have to sit idle if they need to take one branch and let the others take the other branch.

So say I was doing a renderer in CUDA.

I could say, enable light 0 and make it a point light.
Disable light 1.
Disable light 2.

Then I'd have the entire program loaded and just run something like this:
[CODE]
for(int light = 0; light < maxLights; light++) {
if(gl_light[light].enabled) {
switch(gl_light[light].type) {
case LIGHT_SPOT:
//do spot light stuff
break;
case LIGHT_POINT:
//do point light stuff
break;
case LIGHT_DIRECTIONAL:
//do diractional light stuff
break;
}
}
}
[/CODE]

There are branches, but all processors take the same path in the branches and don't cause divergence.

In the world of Shaders I've seen people recommending something like uber shaders to handle different permutations of rendering states to avoid branches since they are supposed to be slower. But is it really slower in shaders when all of them take the same path in the code? You end up having to compile many different shaders and changing which one is loaded based on what you are rendering, which causes some slowdown.

Is there a reason to not just have these much bigger shaders with some if statements?
0

Share this post


Link to post
Share on other sites
[quote name='ill' timestamp='1335426095' post='4934988']all threads take the same branch, as I understand there is no slow down[/quote]No, when all threads take the same branch, there is no [u]extra[/u] slowdown on top of the regular number of cycles it takes to process the condition and branch instructions -- every instruction still has a cost ([i]unless it's optimised out -- N.B. some graphics drivers can optimise out these kind of non-divergent branches when you issue your draw-call, if it's able to determine that all threads will take the same path[/i]).
1

Share this post


Link to post
Share on other sites
[quote name='ill' timestamp='1335426095' post='4934988']
But is it really slower in shaders when all of them take the same path in the code?
[/quote]
Atleast the calculation and testing of the conditions are unnecessary in this case.


PS: Ahh...too slow.. Hodgman is lurking in the shadows all day [img]http://public.gamedev.net//public/style_emoticons/default/ph34r.png[/img] Edited by Ashaman73
0

Share this post


Link to post
Share on other sites
You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.

Anyway you have to realize that a lot of advice regarding graphics is going to come from the era before DX10/Cuda-capable GPU's. This is because old information hangs around the internet instead of dying out, and also because that level of hardware is still prevalent in consoles, mobile devices, and PC's. Before DX10 hardware, branching was generally a much less appealing proposition.
0

Share this post


Link to post
Share on other sites
On older hardware such as the xbox 360 you can go so far as to explain to the compiler exactly what you are trying to do with the branching.
Its far safer to go the permutation route with older hardware
0

Share this post


Link to post
Share on other sites
Also why would you want to do that anyway? Seems like an antipattern to me.
0

Share this post


Link to post
Share on other sites
[quote name='MJP' timestamp='1335427587' post='4934994']
You also have to watch out for increased register pressure from having too many branches, since the compiler will need to allocate enough registers to handle both sides of the branch.
[/quote]

Just highlighting this. If you mix shaders that do a lot of complex lighting math with shaders that are relatively simple, the register requirements of the complex case will kill your warp occupancy (and hence performance) in the simple case.
0

Share this post


Link to post
Share on other sites
There is the increased instruction count, but I was thinking maybe it would be balanced out by not having to constantly switch the loaded shader as you are drawing different materials.

I can see the register count being a problem... But if you have a complex shader using many registers, and a branch that is simpler and uses less, isn't that not really made worse since there are times when you would have the more complex shader loaded and be using a lot of registers anyway?

When you take a branch, I'm pretty sure it uses the same set of registers. I'm not sure how it is on the GPU but on the CPU branch A won't have registers 1-5 reserved and branch B won't have registers 6-10 reserverd.

With good optimizations it should work something more like, branch A would say, I need 3 registers, branch B would say I need 10 registers. So if you take branch A you use registers 1-3, if you take branch B you use registers 1-10.
0

Share this post


Link to post
Share on other sites
So the provided example may not issue branch instructions [i]at all[/i]. This is a candidate for uniform branching, in which case the runtime or driver may choose to produce multiple compilations of the shader where all of the branches have been resolved and the loops unrolled. This was really common before we had hardware branching, in the 2.x days. I'm not sure to what extent it's still used now, but you can hint the compiler to unroll loops and avoid branches.

Uber shaders give you much more precise control over compilation though.
1

Share this post


Link to post
Share on other sites
[quote name='Promit' timestamp='1335458135' post='4935120']
So the provided example may not issue branch instructions [i]at all[/i]. This is a candidate for uniform branching, in which case the runtime or driver may choose to produce multiple compilations of the shader where all of the branches have been resolved and the loops unrolled. This was really common before we had hardware branching, in the 2.x days. I'm not sure to what extent it's still used now, but you can hint the compiler to unroll loops and avoid branches.

Uber shaders give you much more precise control over compilation though.
[/quote]

That sounds really nice. So it basically knows which compiled version to use depending on what arguments I send it?

I would say, glUseProgram(someShader), and based on the parameters I set, it would actually select the real shader I want? In many cases, the branches are super obvious, like this material is not using normal maps, or this light is not casting specular reflections...

Right now I have a system that uses bit masks to figure out the permutations of the uber shader to load and it's cool and all, but if I can have something much simpler, that would be awesome.



And yeah I can see the problem with using too many registers now. It's always good to know how something actually works.
0

Share this post


Link to post
Share on other sites
Let me add one further note on the register allocation problem. If we can decide on one branch before we issue the draw call, Dx has some sweet candy for us. Formerly we would have branched depending on some constant buffer value and probably uniform branching would have kicked in. But now, Dx11 brought us interfaces to HLSL. With those we can define methods, which can be implemented by multiple classes. Before issuing a draw call we can assign a particular class that should be used for an interface variable. The good news is that the driver inlines the hardware native shader code of the methods - declared in the interface and implemented by the selected class - at bind time (!), thereby choosing the optimal register count.

This is supposed to be the solution to the dilemma: ubershaders vs. many specialized shader files. It has two upsides: We can stop worrying about the register allocation (since we’re not branching) and the code becomes cleaner (neither huge branch trees nor dozens of shader files for the permutations).
Of course on the downside it can only optimize the function bodies independently. :-/ But still, it's a very helpful tool.

Allison Klein (GamesFest 2008, slides and audio track are online on MSDN) and Nick Thiebieroz (GDC 09) talked a little on this.
(Edit: In OpenGL the concept is called Subroutine Functions and is basically doing the same.)
1

Share this post


Link to post
Share on other sites
[quote name='Tsus' timestamp='1335469564' post='4935191']
Let me add one further note on the register allocation problem. If we can decide on one branch before we issue the draw call, Dx has some sweet candy for us. Formerly we would have branched depending on some constant buffer value and probably uniform branching would have kicked in. But now, Dx11 brought us interfaces to HLSL. With those we can define methods, which can be implemented by multiple classes. Before issuing a draw call we can assign a particular class that should be used for an interface variable. The good news is that the driver inlines the hardware native shader code of the methods - declared in the interface and implemented by the selected class - at bind time (!), thereby choosing the optimal register count.

This is supposed to be the solution to the dilemma: ubershaders vs. many specialized shader files. It has two upsides: We can stop worrying about the register allocation (since we’re not branching) and the code becomes cleaner (neither huge branch trees nor dozens of shader files for the permutations).
Of course on the downside it can only optimize the function bodies independently. :-/ But still, it's a very helpful tool.

Allison Klein (GamesFest 2008, slides and audio track are online on MSDN) and Nick Thiebieroz (GDC 09) talked a little on this.
(Edit: In OpenGL the concept is called Subroutine Functions and is basically doing the same.)
[/quote]

This is also available in nVidia's Cg library and works on a much wider array of hardware-- it was even working on the old GeForce 6800s way back when GPU Gems (1!) was the hot new thing.

Just FYI :)
0

Share this post


Link to post
Share on other sites
[quote name='InvalidPointer' timestamp='1335472351' post='4935209']
This is also available in nVidia's Cg library and works on a much wider array of hardware-- it was even working on the old GeForce 6800s way back when GPU Gems (1!) was the hot new thing.

Just FYI [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]
[/quote]

Nice, thanks a lot! That’s very good to know! [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]
GPU Gems 1 is indeed quite antique. [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img] Kind of cool that those things were possible for so long.

How does Cg handle this? Is it compiling and optimizing the function bodies individually, too, and inlines them at bind time? Or does it compile all permutations completely? How can I – as a programmer – decide which permutation to pick for the execution?

Can you tell me how the whole thing is called in the Cg terminology, so I can find it easier?
I was curious and started browsing through the [url="ftp://download.nvidia.com/developer/cg/Cg_Specification.pdf"]Cg specification[/url] to find out more. Do you mean “Overloading of functions by profile” (page 170)? Also a nice feature, but that’s not it, isn’t it? This doesn’t seem to solve the permutation issue - or does it?

Thanks!
0

Share this post


Link to post
Share on other sites
[url="http://http.developer.nvidia.com/GPUGems/gpugems_ch32.html"]It's right smack in GPU gems, actually[/url]. The article doesn't go too much into implementation details, but I wager there's some sort of runtime inlining or additional precompilation going on.

EDIT: If they don't mention it in the new language manuals, I may stand corrected here. Wonder if it's been removed/deprecated somehow.

EDIT 2: Based on the API descriptions provided, it's probably the first approach, inlining/AST substitution.
1

Share this post


Link to post
Share on other sites
[quote name='InvalidPointer' timestamp='1335484792' post='4935257']
[url="http://http.developer.nvidia.com/GPUGems/gpugems_ch32.html"]It's right smack in GPU gems, actually[/url]. The article doesn't go too much into implementation details, but I wager there's some sort of runtime inlining or additional precompilation going on.

EDIT: If they don't mention it in the new language manuals, I may stand corrected here. Wonder if it's been removed/deprecated somehow.

EDIT 2: Based on the API descriptions provided, it's probably the first approach, inlining/AST substitution.
[/quote]

Thanks again! [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

I found it in the [url="http://developer.download.nvidia.com/cg/Cg_3.1/CgUsersManual.pdf"]Cg Users Manual[/url]. It is described in section “Shared Parameters and Interfaces” in epic broadness.
0

Share this post


Link to post
Share on other sites
I've been working with OpenGL for a while and recently got into GLSL and never even touched Direct X yet. Does cg run pretty well with OpenGL? I think I remember seeing an extension or something for Cg. Is it a good idea to switch over to Cg and try to make use of that feature?

Knowing OpenGL seems very useful since it pretty much runs on every device I've tried, Windows, Mac, Linux, iPhone, Android...
0

Share this post


Link to post
Share on other sites
[quote name='ill' timestamp='1335507175' post='4935306']
I've been working with OpenGL for a while and recently got into GLSL and never even touched Direct X yet. Does cg run pretty well with OpenGL? I think I remember seeing an extension or something for Cg. Is it a good idea to switch over to Cg and try to make use of that feature?

Knowing OpenGL seems very useful since it pretty much runs on every device I've tried, Windows, Mac, Linux, iPhone, Android...
[/quote]

Yes, very much so. It can be pretty accurately described by the phrase 'HLSL for OpenGL,' in fact.
0

Share this post


Link to post
Share on other sites
This remibds me of an blog post that I read a while ago about simulating closures in HLSL. As far as I rember it works on SM 3.0 and up.
[url]http://code4k.blogspot.com/2011/11/advanced-hlsl-using-closures-and.html[/url]
1

Share this post


Link to post
Share on other sites
Wow, HLSL looks way better than GLSL. More like C++... What the hell... So Cg is similar to HLSL? I should switch to Cg...
0

Share this post


Link to post
Share on other sites
When Cg and HLSL were first created they were pretty much identical. Since D3D10 however they have diverged quite a bit.
0

Share this post


Link to post
Share on other sites
[quote name='ill' timestamp='1335723403' post='4935861']So Cg is similar to HLSL?[/quote]HLSL and Cg are almost exactly the same language -- both Microsoft and nVidia cooperated to create a "[i]high level shading language[/i]" together. When they were done, MS published their version under the name "[i]DirectX HLSL[/i]" and nVidia published theirs under the name "[i]nVidia Cg[/i]", but they were both almost the same language (with different supporting tools/APIs/extensions). Edited by Hodgman
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0