Jump to content

  • Log In with Google      Sign In   
  • Create Account

Are GPU drivers optimizing pow(x,2)?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
15 replies to this topic

#1 CryZe   Members   -  Reputation: 768

Like
0Likes
Like

Posted 09 October 2012 - 05:23 AM

The power function usually is about 6 times slower than a simple mad-instruction (at least on current NVidia GPUs). The HLSL compiler itself doesn't optimize it and simply converts the pow into a LOG, a MUL and an EXP instruction. But most constant powers up to x^32 would actually be faster to be calculated by using just MUL instructions. Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this? Here is the function I would be using, if I had to optimize it myself:

float constpow(float x, uint y)
{
	if (y == 0)
		return 1; //Cost 0

	if (y == 1)
		return x; //Cost 0

	float x2 = x * x; //Cost 1

	if (y == 2)
		return x2; //Cost 1

	if (y == 3)
		return x2 * x; //Cost 2

	float x4 = x2 * x2; //Cost 2

	if (y == 4)
		return x4; //Cost 2

	if (y == 5)
		return x4 * x; //Cost 3

	if (y == 6)
		return x4 * x2; //Cost 3

	if (y == 7)
		return x4 * x2 * x; //Cost 4

	float x8 = x4 * x4; //Cost 3

	if (y == 8)
		return x8; //Cost 3

	if (y == 9)
		return x8 * x; //Cost 4

	if (y == 10)
		return x8 * x2; //Cost 4

	if (y == 11)
		return x8 * x2 * x; //Cost 5

	if (y == 12)
		return x8 * x4; //Cost 4

	if (y == 13)
		return x8 * x4 * x; //Cost 5

	if (y == 14)
		return x8 * x4 * x2; //Cost 5

	float x16 = x8 * x8; //Cost 4

	if (y == 16)
		return x16; //Cost 4

	if (y == 17)
		return x16 * x; //Cost 5

	if (y == 18)
		return x16 * x2; //Cost 5

	if (y == 20)
		return x16 * x4; //Cost 5

	if (y == 24)
		return x16 * x8; //Cost 5

	if (y == 32)
		return x16 * x16; //Cost 5

	return pow(x, y);
}

If the drivers would do this themselves, it would probably be better to just leave the pow(x, y) there, because they know better when to optimize it. I'd obviously only use this when y is constant. I obviously don't want to have any dynamic branching here.

Edited by CryZe, 09 October 2012 - 05:36 AM.


Sponsor:

#2 bwhiting   Members   -  Reputation: 814

Like
0Likes
Like

Posted 09 October 2012 - 05:39 AM

I guess the best thing to do would be to run some tests.

Why don't you run some full screen quads using each method and use something like pix/perfHud to see how long each one ended up taking? Shouldn't take too long.

(only saying as I don't know the answer to your question) I wouldn't have though they would optimize in that way and in a number of cases you will now the power so you can write them out with muls on a case by case basis.

#3 mhagain   Crossbones+   -  Reputation: 8277

Like
1Likes
Like

Posted 09 October 2012 - 05:54 AM

A branching version is quite likely to run even slower than just calling pow directly (especially so with GPU code where not all pixels in a group may take the same path), so I wouldn't even contemplate doing it that way. If you know that your exponent is always going to be in a certain range you may be able to encode it into a 1D texture, but I haven't benchmarked that against just using pow so I can't say anything regarding performance comparisons.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#4 Ashaman73   Crossbones+   -  Reputation: 7991

Like
1Likes
Like

Posted 09 October 2012 - 05:59 AM

Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this?

This is the old discussion of 'oh, if I write this cool pretty fast asm, I can beat the compiler' topic. The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow. Especially if you consider different plattforms and future driver and hardware updates.

#5 Lightness1024   Members   -  Reputation: 739

Like
0Likes
Like

Posted 09 October 2012 - 06:38 AM

If you compile with fxc you can output the intermediate directx assembly text file, and just check what pow(x,2) was transformed to.
be careful, it strongly depends on the shader profile you target. for example, sin(x) is expanded into 4 mad in shader model 1, but it uses sincos(x) from shader model 2 since it is an intrinsic. there are similar differences with x*2 often implemented as x+x but it varies according to profiles.
Also always keep in mind there is a second compilation stage by the driver.
most likely the pow(x,2) will already be mutated into some cheaper ALU calls by fxc compiler itself.
like previously mentioned, test perf in practice to be sure, but I have a feeling it will be difficult to see any difference because the noise might be greater than the difference in that case.
don't forget to post results here ^^
cheers

#6 Hodgman   Moderators   -  Reputation: 31822

Like
3Likes
Like

Posted 09 October 2012 - 06:39 AM

A branching version is quite likely to run even slower than just calling pow directly

True that, however, the OP is only planning on using that function with constant literal arguments, and hoping that the HLSL compiler then goes ahead and evaluates the branches at compile-time.

If you compile with fxc you can output the intermediate directx assembly text file, and just check what pow(x,2) was transformed to.

The OP mentioned that they did that, and it wasn't optimised. They're wondering if the D3D ASM -> GPU ASM compilation process done by your driver will actually perform this optimisation or not.

The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow.

I actually trust my HLSL compiler more than my C++ compiler for a lot of things, but they're still very dumb sometimes.
I've got a full 2x boost by 'massaging' my HLSL code into something uglier that the compiler was more able to easily digest... Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code to do things that I originally assumed the compiler would be smart enough to do.

That said, the rules for "fast HLSL" change from time to time -- if you're targeting a DX9-era card, you want to hand-vectorize your code to use all 4 components of a float4 wherever possible to reduce instruction counts (FXC does actually do a good job of auto-vectorizing, but not as good as a human), but if you're targeting a DX10-era card, you want to mask off as few elements as possible (e.g. if you only need xyz, make sure to use a float3, or a float4 with .xyz on the end) and not be afraid of scalar math.
So, you can help the compiler to produce much faster code, but you do also need to know which kind of GPU architecture you're targeting while micro-optimizing your HLSL code.

Also, keep in mind that 1080p has ~2.5M pixels, making your pixel shaders the most intensely burdened tight loop in your entire code base, which means a small inefficiency can have a very large impact.

Edited by Hodgman, 09 October 2012 - 06:41 AM.


#7 Ashaman73   Crossbones+   -  Reputation: 7991

Like
0Likes
Like

Posted 09 October 2012 - 06:50 AM

Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code to do things that I originally assumed the compiler would be smart enough to do.

The interesting question is, did you leave this optimization for all hardware, or did you use (pre-processor-)branching to optimize your shader for certain videochip classes only. I still think, that a general optimization is a bad idea, though I can understand that external pressure (this title must run at 30 fps on my mother PCs) is a good reason to bend this rule Posted Image

#8 Hodgman   Moderators   -  Reputation: 31822

Like
2Likes
Like

Posted 09 October 2012 - 07:17 AM

The interesting question is, did you leave this optimization for all hardware, or did you use (pre-processor-)branching to optimize your shader for certain videochip classes only... I can understand that external pressure (this title must run at 30 fps on my mother PCs)

We hand optimized only for the almost-min-spec because the vast majority of sales are for it, rather than say, DX11 PCs (and they can run it regardless of perfect HLSL code Posted Image). For absolute-min-spec, we just by default disabled some features to get frame-rate up, and didn't care too much because it's only your mother and she won't notice.
Also, as you said, compilers are always getting better, so hopefully a modern PC's driver can pull apart our hand-vectorized shader code and put it back together into "modern" efficient code. On this topic -- the Unity guys actually compile their GLSL code (performing optimisations), and then output the results as regular text GLSL code -- so that on drivers with bad GLSL compilers, the result is still optimized!
Sorry for off-topic Posted Image

@OP - are you using literal pow values often enough to warrant this effort? Posted Image I can only remember using constants of maybe 2/3/4/5, and I've just written your unrolled versions in-place for those cases. If a compiler is smart enough to realize that pow(x,2)==x*x, then it should also be smart enough to realise that (x*x)*(x*x)==pow(x,4) and pick the best anyway -- so if the hand optimisation is harmful to a new GPU with a smart compiler, it would be able to undo your cleverness.

Edited by Hodgman, 09 October 2012 - 07:20 AM.


#9 CryZe   Members   -  Reputation: 768

Like
0Likes
Like

Posted 09 October 2012 - 07:41 AM

are you using literal pow values often enough to warrant this effort? I can only remember using constants of maybe 2/3/4/5, and I've just written your unrolled versions in-place for those cases. If a compiler is smart enough to realize that pow(x,2)==x*x, then it should also be smart enough to realise that (x*x)*(x*x)==pow(x,4) and pick the best anyway -- so if the hand optimisation is harmful to a new GPU with a smart compiler, it would be able to undo your cleverness.


I'm implementing this BRDF at the moment (Cook-Torrance with GGX distribution, Schlick Fresnel, Walter GGX Geometric Attenuation and modified energy conserving Lambert):
Posted Image
is Albedo, is the refractive index of the first medium, is the refractive index of the second medium and is the roughness of the material.
As you can see, there are quite a few literal pow values. But the implementation of it is WAY lighter than what you're seeing here. Due to Helmholtz reciprocity about 50% of the BRDF can be calculated once per pixel (the view dependent part of it) and most of the calculations can be reused from the view dependent part. So only about 25% actually needs to be calculated per light. And I believe it's worth it. Probably not though Posted Image

But I guess you're right. New GPUs are probably capable of "deoptimizing" my code. I'll do a few benchmarks though.

Edited by CryZe, 09 October 2012 - 07:49 AM.


#10 MJP   Moderators   -  Reputation: 11761

Like
3Likes
Like

Posted 09 October 2012 - 01:10 PM

When I've cared enough to check the assembly in the past, the HLSL compiler has replaced pow(x, 2) with x * x. I just tried a simple test case and it also worked:

Texture2D MyTexture;
float PSMain(in float4 Position : SV_Position) : SV_Target0
{
    return pow(MyTexture[Position.xy].x, 2.0f);
}

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_resource_texture2d (float,float,float,float) t0
dcl_input_ps_siv linear noperspective v0.xy, position
dcl_output o0.x
dcl_temps 1
ftou r0.xy, v0.xyxx
mov r0.zw, l(0,0,0,0)
ld_indexable(texture2d)(float,float,float,float) r0.x, r0.xyzw, t0.xyzw
mul o0.x, r0.x, r0.x
ret
// Approximately 5 instruction slots used

I wouldn't be surprised it he HLSL compiler got tripped up every once in a while, but there's also the JIT compiler in the driver too. So you'd have to check the actual microcode to know for sure, if you access to that.

#11 Lightness1024   Members   -  Reputation: 739

Like
0Likes
Like

Posted 09 October 2012 - 03:19 PM

darn, I didn't know drivers where doing JIT Posted Image , I always assumed it was only statically analyzed.

#12 CryZe   Members   -  Reputation: 768

Like
0Likes
Like

Posted 10 October 2012 - 06:58 AM

Thanks for the answers. Good to know, that FXC optimizes pow(x, 2). I thought, that I checked that, but looks like I didn't Posted Image

But I'm still pretty sure that it won't optimize it for other literals (couldn't test it in the meantime though). The thing is that pow(x, 2) isn't really what's interesting about it. I just put it in the title to make clear what this topic is about. It's pretty obvious that a single MUL is always faster or at least as fast as a POW. It get's more interesting for other literals though. Especially when one is implementing Schlick's approximation of Fresnel.

One could implement it this way: (wow gamedev.net can't handle multiple lines of code in the code-tag right now, that explains why so many users post such misaligned code right now)
float rLDotH = 1 - LDotH;
float rLDotH2 = rLDotH * rLDotH;
float rLDotH5 = rLDotH2 * rLDotH2 * rLDotH;
float fresnel = reflectivity + (1 - reflectivity) * rLDotH5;
//Or even simpler:
//float fresnel = reflectivity + (1 - reflectivity) * constpow(1 - LDotH, 5);
//which has the same effect on the resulting assembly


Or this way:
float fresnel = reflectivity + (1 - reflectivity) * pow(1 - LDotH, 5);

Which one is the preferable implementation? Like I said, I had no time to do some benchmarks, but I'll do some. But the thing is, that my graphics card might have a pretty slow or really fast POW instruction compared to other graphics cards. So it might not be that representative of all graphics cards. So a single benchmark won't tell me which implementation is the more preferable in the average case.

Edited by CryZe, 10 October 2012 - 07:31 AM.


#13 Lightness1024   Members   -  Reputation: 739

Like
0Likes
Like

Posted 10 October 2012 - 03:27 PM

maybe you could try to guess from the assembly how many ALU vs SFU slots you use in your shader and try to balance that according to the ratio of the respectuve units in your target average card.

#14 chris77   Members   -  Reputation: 264

Like
0Likes
Like

Posted 11 October 2012 - 09:17 AM

AMD has a tool called GPU Shader Analyzer that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.

#15 Dave Eberly   Members   -  Reputation: 1161

Like
0Likes
Like

Posted 15 October 2012 - 01:05 AM

AMD has a tool called GPU Shader Analyzer that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.


The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...

#16 MJP   Moderators   -  Reputation: 11761

Like
0Likes
Like

Posted 15 October 2012 - 01:26 AM

The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...


That's weird...the SM 4.0, 4.1, and 5.0 profiles are all available for me without doing anything on the command line, and they work just fine.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS