Shaders, Registers, and You

Started by
6 comments, last by Hodgman 13 years, 9 months ago
In HLSL you can specify a boolean as being associated with a special register and in doing so you can make conditionals much faster.

bool g_bGoochLighting : register( b10 ) = false;


Does such torment exist in GLSL?
If not, what kind of behavior should I be expected from conditionals?
In HLSL, both sides of if/else are fully processed and then the results interpolated based off which conditional was supposed to succeed.


Thank you,
Yogurt Emperor

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement
Quote:Original post by YogurtEmperor
Does such torment exist in GLSL?
If not, what kind of behavior should I be expected from conditionals?

I would expect nothing, it depends on the GPU and driver version. When you are coding in C++ did you ever think about what local variable will be loaded in a register ;-)


Quote:Original post by YogurtEmperor
In HLSL, both sides of if/else are fully processed and then the results interpolated based off which conditional was supposed to succeed.

A optimizer might optimize this way, but it depends. On older hardware without branching capacity this will most probably the way to go, on newer it depends on which way will be faster.
IIRC any if statement based on a uniform bool value in HLSL results in a "static branch", where all pixels being processed in parallel are guarenteed to take the same branch, so there's less of a performance hit. Any other if statements (besides compile-time ones) become dynamic branches, which are much slower. You can also use [hints] to ask the compiler to either use an actual branch instruction or the old behaviour of evaluating both sides and lerping them to get the correect result.

However, I don't think the GLSL specification goes into any of these details at all (in fact, it barely describes flow control) -- it seems it's pretty much up to the driver/chipset manufacturer as to how things will behave.
If you're targetting desktop PCs, then I'd assume pretty similar behaviour to HLSL (i.e. branching on a uniform bool will be fairly fast), but if you're targetting something else, like an iPhone, then I'd make no assumptions... :(
Thank you.

I am targeting all of the above and more, as my shader foundation will have to support Nintendo Wii, PlayStation 3, Macintosh, PC, Xbox 360, and iPhone/iPod/iPad.


In finding the lowest common denominator between each of these systems it seems I cannot have register specification in my shader language, but I suppose I can have hints like in HLSL. They would be ignored in GLSL and used in HLSL.


Thank you again.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

As far as I know, there's no commercially available high-level shader compilers for the Wii yet - if you're making one it'd be a good piece of middleware ;)
It's my understanding that branching is not nearly as simple as it looks on GPUs. I was actually talking about this with nVidia the other day. Bunch of what was discussed is under NDA, but the more general stuff is probably OK to pass on.

The way the processors work in current generations, is that when a branch is hit, threads of execution in one core which would go one way are suspended. The other threads run to completion.

At which point the core goes back and collects branched threads and runs them to completion. This operates recursively -- there's a stack of "suspended threads" and their restart program counter.

But there's only ONE program counter for all the threads which are running. Which is a bit of an eye-opener, and really puts the cost of branching into a whole new light -- nVidia's recommendations are to do redundant maths and throw values away rather than branching if at all possible. Maths is cheap, branching is not.

Now whether this magic boolean register is connected to that at some level, I don't know.

But from the info I've been given, branching is *fundamentally* really, really bad, and if you want to do something like light calculations which you can turn on and off, a better option may be to actually do the work and then simply multiply the result by an input uniform (which you can set to 1 or 0).

{This, of course, is for PC GPUs. I've no idea what console ones look like inside.}

Well, what Katie said pretty much matches how I thought it'd work. I've read before that GPUs are SIMD, which is how they manage to process lots of data without some really expensive hardware. But this also means no proper branching, so I figured out they probably found out some way to ignore processing on threads where the wrong code would run when a branch is "taken".

Notice this also means that there is not only a single program counter, but probably also a single instruction decoder and such. In fact, the ALUs and where results are stored are probably the only things that aren't shared - and I'm figuring out all this because of the description of SIMD, so feel free to correct me if I'm wrong =P
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.
This microsoft pubilcation has some great details on the different kinds of flow control implementations on GPUs.

Branching on a uniform bool is much faster than branching on other variables because the GPU knows it can schedule all the pixels in that draw-call to take the same path. In this case, branching can sometimes be an optimisation, but usually it's just evil ;)

[Edited by - Hodgman on July 14, 2010 10:16:09 PM]

This topic is closed to new replies.

Advertisement