Sign in to follow this  
Oogst

Boolean operations in shader assembly

Recommended Posts

What is the best way to do boolean operations in shader assembly for shader model 3.0 DirectX? So far the only way I have seen is smartly using floats to represent them, is there something easier/more efficient/more direct?
 
For example, if I want to do something like this:
 
if (a < b && (c < d || e < f))
{
    ...
}

I could break this down to roughly this (slightly simplified from assembly):
 
if_lt a, b
{
    sub t, c, d
    cmp t, t, 0, 1
    sub s, e, f
    cmp s, s, 0, 1
    add t, t, s
    if_gt t, 0.5
    {
        ...
    }
}

Is there some better way to do this, or is this just how I am supposed to implement this?
 
(By the way, the reason I have recently started learning shader assembly is that I have a shader for which both Cg and HLSL require too many temporaries to run, while I know it can be done with less.) Edited by Oogst

Share this post


Link to post
Share on other sites

So far the only way I have seen is smartly using floats to represent them, is there something easier/more efficient/more direct?

If they're dynamic branches (decisions that have to be made per-pixel/vertex), then yeah, you have to use float because there are no integer/bool registers in SM3.

 

The best way to deal with branching is to avoid doing branching laugh.png cool.png

 

In your example, you end up using two different branches (one nested in the other). It would likely be much better to just use a single branch, due to how expensive they are on SM3 hardware...

 


By the way, the reason I have recently started learning shader assembly is that I have a shader for which both Cg and HLSL require too many temporaries to run, while I know it can be done with less

Does the compiler fail due to this, or the runtime?

 

Usually you can massage your HLSL code to produce better asm, rather then writing the asm yourself (e.g. by using the HLSL attributes such as branch, flatten, fastopt, forcecase, call, unroll, loop, isolate, or by writing more asm-like code).

e.g. to tell the compiler to use a series of cmp instructions and arithmetic for that branch, you could try something like:

[branch]
if( 0 < step(a,b) * (step(c,d) + step(e,f)) )
Edited by Hodgman

Share this post


Link to post
Share on other sites

 

...you have to use float because there are no integer/bool registers in SM3.

 

 

Then WTF is these used for?

 

As noted on the page that deals with how to use those (http://msdn.microsoft.com/en-us/library/windows/desktop/bb174580(v=vs.85).aspx) that's for static flow control. i.e. not determined at the shader execution time.

Share this post


Link to post
Share on other sites
Yeah I meant actual dynamic variables in the shader functions (i.e. non-uniform ones).
You can have 16 'constant' (uniform) bools and 16 ints, but these types don't really exist at runtime - there's no instructions to operate on them. If you want to work with them, you'll be copying the results into temporary float registers.

As Washu mentions above, the 16 bool constants are generally used as a 16bit mask that controls static branching.

Share this post


Link to post
Share on other sites

Ok. If you don't mind i would like to reuse this topic for some questions.

1. So in some places i would like to avoid switching shaders and i see that this static bool (constant for shader execution but different on per drawcall basis) and is costing 1 instruction (it says that in docs), is it cheaper then switching shader and is it cheaper then dynamic branching?

2. I have this situation for parallel split shadow maps, which of these are cheaper:

this:

if(posVS.z > SplitDist.x && posVS.z <= SplitDist.y) // if in split range
{ ...lots_of_calculus() }

 

or this:

clip(posVS.z - SplitDist.x);
clip(SplitDist.y - posVS.z);
...lots_of_calculus()

Share this post


Link to post
Share on other sites
Thanks folks, I'll just work with the floats then! smile.png
 

1. So in some places i would like to avoid switching shaders and i see that this static bool (constant for shader execution but different on per drawcall basis) and is costing 1 instruction (it says that in docs), is it cheaper then switching shader and is it cheaper then dynamic branching?

This cannot be answered in the general case, because the shader operates per pixel (if it is a pixel shader), and the shader switching is per object. It is impossible to compare the two in anything but a specific situation, because the scene might be a few really large objects (tons of pixels, few shader switches) or a ton of small objects (few pixels, lots of shader switches).

Also, how long shader switching takes, depends on CPU, GPU, DirectX-version and drivers... Edited by Oogst

Share this post


Link to post
Share on other sites
While we are at the topic of shader assembly: is there a shader instruction equivalent to the Cg/HLSL/GLSL function saturate(x)? This function limits a value to the range [0, 1]. Since it is an intrinsic function in Cg, HLSL and GLSL, I expected to see it in shader assembly as well, but I couldn't find it...

Share this post


Link to post
Share on other sites

Saturate is supported as a modifier for certain instructions. So if you do something like x = saturate(y * z), you'll end up with a mul_sat instruction in your assembly.

This actually maps to how GPU's implement saturate in their native microcode.

Edited by MJP

Share this post


Link to post
Share on other sites

So in some places i would like to avoid switching shaders and i see that this static bool (constant for shader execution but different on per drawcall basis) and is costing 1 instruction (it says that in docs), is it cheaper then switching shader and is it cheaper then dynamic branching?

Everything to do with performance characteristics is implementation defined, so you'll have to profile to get answers ;) but...
There's two main implementation options the driver could use:
1) It internally performs a shader switch for you. Basically, it takes your supplied shader code, finds all the permutations based on the static branches, and internally creates one compiled shader program for each permutation. Before each draw call, it checks the values of the 16 booleans to pick the appropriate shader code.
In this case, it's the same as if you implemented your own shader permutation system. Switching shaders is basically free, as long as the previous draw-call covered a few hundred pixels.
2) It leaves the branch in there, performing it per-pixel.
In this case, you're probably going to burn a bunch of cycles per pixel in exchange for the convenience of not having to switch shaders. It will likely be faster than a dynamic branch (e.g. branching on the results of some float computations) by a good amount -- e.g. if a dynamic branch instruction takes a dozen cycles to complete, a static branch instruction might take half a dozen cycles...
 

I just checked, but saturate as a modifier is only available from shader model 4, while I am using shader model 3. According to documentation: http://msdn.microsoft.com/en-us/library/windows/desktop/hh447231(v=vs.85).aspx
 
The saturate function in HLSL has been around since shader model 1.

That's really weird, because if I look at the asm output that I'm getting from my compiled SM3 code, it does include instructions like mul_sat (which aren't listed on the MSDN instruction reference for SM3...).
 
The MSDN also shows that the _sat modifier did exist in SM1...

[edit] The ps_2/ps_3 modifiers are documented here (and for vs_3 here). Mystery solved. That page that says that the modifier is only available in SM4+ is just wrong :/

Edited by Hodgman

Share this post


Link to post
Share on other sites

Ok, thanks. I just wanted to know on what to base my choices.

Now i have one more question if you don't mind. Does some intrinsic functions result in branching like min, max, saturate? I don't see how this can be done different way without checking input variable.

Share this post


Link to post
Share on other sites

Things like min/max/saturate do not branch. They are extremely simple things that the hardware can just do directly, no need to jump to a different spot in code for that.

Share this post


Link to post
Share on other sites

1. So in some places i would like to avoid switching shaders and i see that this static bool (constant for shader execution but different on per drawcall basis) and is costing 1 instruction (it says that in docs), is it cheaper then switching shader and is it cheaper then dynamic branching?

 

Just a word of warning about this line of thinking.

 

On the surface it looks like "it's just one extra instruction, I'll eat it, it's no bother".

 

It's not that simple.  Assuming that this shader is going to cover every pixel in your window, assuming that you have perfect overdraw elimination (hint: you don't), and assuming a 1600x900 resolution, it's actually just under 1.5 million extra instructions.  That's what you should be comparing the cost of a shader switch against.

Share this post


Link to post
Share on other sites

Does some intrinsic functions result in branching like min, max, saturate? I don't see how this can be done different way without checking input variable.

No, the instruction set will contain instructions for performing those operations without branching... or in other words, any branching that is required internally by those algorithms is embedded into the silicon and doesn't count.
e.g. when you write a + b, the hardware might have to make a bunch of decisions based on the sign of a and b (i.e. to actually perform subtraction), but all of that logic is embedded right into the addition hardware, so it all gets done in a single clock cycle.
The logic for min/max/saturate/etc is also built right into the hardware, so no branching of the code is required.

Also, note that there's a lot of other things that you can do in shaders without branching, which you'd traditionally use if statements for in CPU-side code.
e.g. instead of:
if( g_PowerupAmount >= 0.5 ) color = yellow;
else color = white;
You can actually perform that kind of selection without a branch:
color = powerupAmount >= 0.5 ? yellow : white;
//in pseudo asm:
//sub temp powerupAmount 0.5
//cmp color temp yellow white
Edited by Hodgman

Share this post


Link to post
Share on other sites

Just a word of warning about this line of thinking.

 

On the surface it looks like "it's just one extra instruction, I'll eat it, it's no bother".

 

It's not that simple.  Assuming that this shader is going to cover every pixel in your window, assuming that you have perfect overdraw elimination (hint: you don't), and assuming a 1600x900 resolution, it's actually just under 1.5 million extra instructions.  That's what you should be comparing the cost of a shader switch against.

 

It is indeed a lot of instructions, but also keep in mind that a modern videocard happily does much more. My 3 year old videocard easily does a 1500 instructions post effect on 1920x1200. That means a whopping 3,456,000,000 instructions per frame for just that post effect, and my 3 years old videocard easily does this will above 60fps. So in comparison to what modern videocards can do, 1.5 million instructions is peanuts...

 

Which is no reason to just throw away performance, of course. :)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this