Shader clock cycles & optimizations

Started by
24 comments, last by HellRaiZer 17 years, 8 months ago
Hi, I was looking a little bit around into performance of shaders. Obviously, the less operations the GPU has to execute, the better. Now I was checking the assembler code from FX Composer. I don't know much about optimization, so I tested a simple line:

// 1.
float3 value = tex2D( tex, texcoord );
       value = 2 * value -1;
// Compiled, 3 operations:
def c0, -1,-1,-1,-1
tex t0
add r0, t0, t0
add r0, r0, c0


// 2.
float3 value = tex2D( tex, texcoord );
       value = 0.5 * value -1;
// Compiled, 2 operations:
def c0, 0.5,0.5,0.5,0.5
def c1, -1,-1,-1,-1
tex t0
mad r0, t0, c0, c1
The second formula takes only 2 operations. But I would say formula 1 is more simple. MAD is a combination of 2 instructions, so I suppose its slower than ADD for example. Is there an overview to see how many cycles each instruction takes? I looked into the FX Composer Perf tool, but it says "cyles:1.00" for both formula's. Talking about optimizations. I compared 2 other lines:

float3 normal = 2 * tex2D( normalMap, texcoord ) -1;
float3 normal = 2 * ( tex2D( normalMap, texcoord ) -0.5 );
I would say the first line is faster, since I'm working with integer numbers. But the second line took 2 instructions instead 3: tex t0 add r0, t1_bias, t1_bias t0_bias? Doesn't that need to be calculated first somehow before? Greetings, Rick
Advertisement
Hi Rick,

Most instructions only take one clock cycle. Macro instructions are composed of multiple instructions so they take longer. For example m3x4 is actually four dp3 instructions, so it takes four clock cycles. And pow(x, y) is actually exp(mul(y, log(x)) so it's three clock cycles. Also, nrm is a dp3, rsq and mul. The mad instruction appears to be two instructions but it's actually one. The 'instruction slots' in the DirectX SDK documentation gives a good idea of which instructions take longer because they are broken into smaller instructions.

But there are exceptions. Newer graphics chips often have multiple execution units per pixel pipeline, and can have specialized units as well. For example nrm takes only two clock cycles on an NVIDIA Geforce Series 6/7, because it has two cascaded execution units. And nrm_pp even takes only one clock cycle because it has it's own execution unit. Also, modifiers like _bias are for free. Geforce Series 7's execution units can each do a full mad instruction (so in most situations it counts as only 0.5 clock cycles).

Detailed reviews on sites like Beyond3D.com can give you some insight into the chip architectures. And using tools like FX Composer can reveal how many clock cycles a shader takes in practice.

Hope this helps,

Nick
Thanks!

But what still wonders me is the difference between the 2 ADDS or 1 MAD. Composer said they both need 1 cycle. But from a listing I saw, 1 ADD is also 1 cycle. So the 2 ADDS should be slower I think. But that's a little bit strange as well, since the compiler would choose the wrong optimization in that case.

I'm trying to write my Cg shaders a little bit proper. I noticed sometimes a shader can be faster (less operations/cycles) simply by putting a certain line somewhere else. I guess this has to do with the registers that are available. Could someone give a couple good tips when writing code in a high level language such as Cg?

Greetings,
Rick
Quote:Original post by spek
But what still wonders me is the difference between the 2 ADDS or 1 MAD. Composer said they both need 1 cycle. But from a listing I saw, 1 ADD is also 1 cycle. So the 2 ADDS should be slower I think. But that's a little bit strange as well, since the compiler would choose the wrong optimization in that case.

If you have a Geforce Series 7 card then both execution units can do MAD.

So they can do 2 ADDS in one clock cycle, in parallel. If your shader does only one ADD it just means the second shader unit isn't used, and it still takes one whole clock cycle to execute the one ADD. So the 0.5 clock cycle is an abstraction that only counts for pairs of instructions that can execute in parallel.
Quote:I'm trying to write my Cg shaders a little bit proper. I noticed sometimes a shader can be faster (less operations/cycles) simply by putting a certain line somewhere else. I guess this has to do with the registers that are available. Could someone give a couple good tips when writing code in a high level language such as Cg?

In theory, the driver should optimize both versions the same way if they are equivalent. Unfortunately this isn't entirely the case yet in practice. What's even worse is that we don't have control over this and it can change with newer drivers (hopefully for the better).

So if performance is of significant importance try writing your shaders in assembly. If it's of lesser importance just use Cg and write clean code instead of trying to manipulate the code produced by the driver.

Edit: corrected some confusion about which execution unit can handle which instruction. Thanks to Eric Lengyel.

[Edited by - C0D1F1ED on August 10, 2006 4:05:38 AM]
Try looking for "The COMPLETE HLSL Reference"
on google, that is quite a handy pdf for reference
Keep in mind that the assembly you see isn't necessarily the same that gets executed on the GPU. The driver will still change some instructions and reorder things a bit to pack instructions into the available slots for each cycle. There are also a bunch of other things going on behind the scenes, such as perspective correction operations when you access an interpolant.

The GF6 and GF7 both have two arithmetic units and a texture unit that execute instructions in a cascaded manner each cycle in the order ALU1 -> TEX -> ALU2. On a GF6, the first arithmetic unit can perform MOV, MUL, RCP, and DIV only. On a GF7, it can also perform ADD, MAD, and RSQ. (RSQ is broken on GF6 and is emulated in two cycles with LG2/EX2.) The second arithmetic unit can perform everything except RCP, DIV, and RSQ on both chips. A half-precision NRM.xyz instruction executes on a dedicated unit in parallel to the texture unit. Result operations such as scaling by 2, 4, 8, 1/2, 1/4, or 1/8, saturate, and signed saturate are free. The negation, absolute value, and some clamping operations on input values are free, as is the x*2-1 operation on a texture result.

This is just a very simple overview of the fragment shader functionality. The details are far more complex and involve things like register read and write ports and inter-unit bandwidth, but I'm not allowed to talk about those.
Quote:Original post by Eric Lengyel
On a GF6, the first arithmetic unit can perform MOV, MUL, RCP, and DIV only. On a GF7, it can also perform ADD, MAD, and RSQ. (RSQ is broken on GF6 and is emulated in two cycles with LG2/EX2.) The second arithmetic unit can perform everything except RCP, DIV, and RSQ on both chips.

Just out of interest, where did you get this information? I know about the two execution units and roughly which instructions they can handle, but not details like RSQ being broken on Series 6 (is it truely broken or did they decide to save transistors by handling it as x1/2)?

Quote:Original post by C0D1F1ED
Quote:Original post by Eric Lengyel
On a GF6, the first arithmetic unit can perform MOV, MUL, RCP, and DIV only. On a GF7, it can also perform ADD, MAD, and RSQ. (RSQ is broken on GF6 and is emulated in two cycles with LG2/EX2.) The second arithmetic unit can perform everything except RCP, DIV, and RSQ on both chips.

Just out of interest, where did you get this information? I know about the two execution units and roughly which instructions they can handle, but not details like RSQ being broken on Series 6 (is it truely broken or did they decide to save transistors by handling it as x1/2)?


A while ago, I did an obscene amount of reverse engineering of the NV4x driver. I can see and disassemble the hardware command buffer, and more pertinently to this thread, the actual machine instructions for vertex programs and fragment programs. The instruction word layouts are virtually the same on GF6 and GF7, but the opcode for RSQ is never used on GF6 -- the driver always generates LG2/EX2 instead, so I'm assuming that RSQ is broken. I was able to learn enough through reverse engineering that I wrote a low-level RSX driver for the PS3 which is now in use by Sony first-party game studios. Later, I was commissioned to do a study that involved the internal workings of the RSX fragment program hardware, so I had access to internal Nvidia documentation. I'm not allowed to talk about what I saw in those documents until 2016, but I can tell you things I learned through reverse engineering.
Quote:Original post by Eric Lengyel
I was able to learn enough through reverse engineering that I wrote a low-level RSX driver for the PS3...


I *BOW* To you, master Lengyel.

Seriously, my jaw literally dropped when I read this post.
Holy crap I started a blog - http://unobvious.typepad.com/
Quote:Original post by spek
Talking about optimizations. I compared 2 other lines:
float3 normal = 2 * tex2D( normalMap, texcoord ) -1;float3 normal = 2 * ( tex2D( normalMap, texcoord ) -0.5 );

I would say the first line is faster, since I'm working with integer numbers.


Just to pick up on that: While integer math may be faster than float math on the CPU, it's not generally the case on the GPU. Remember that GPUs are designed to handle lots of floating-point data and that integer registers are a relatively recent addition.

Quote:But the second line took 2 instructions instead 3:
tex t0
add r0, t1_bias, t1_bias

t0_bias? Doesn't that need to be calculated first somehow before?


_bias is a modifier, as C0D1F1ED said, which is just hardwired to subtract 0.5 from its argument. It comes for free.

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

This topic is closed to new replies.

Advertisement