Are GPU drivers optimizing pow(x,2)?

Christopher Serr · 2012-10-15T07:26:47

The power function usually is about 6 times slower than a simple mad-instruction (at least on current NVidia GPUs). The HLSL compiler itself doesn't optimize it and simply converts the pow into a LOG, a MUL and an EXP instruction. But most constant powers up to x^32 would actually be faster to be calculated by using just MUL instructions. Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this? Here is the function I would be using, if I had to optimize it myself: float constpow(float x, uint y) { if (y == 0) return 1; //Cost 0 if (y == 1) return x; //Cost 0 float x2 = x * x; //Cost 1 if (y == 2) return x2; //Cost 1 if (y == 3) return x2 * x; //Cost 2 float x4 = x2 * x2; //Cost 2 if (y == 4) return x4; //Cost 2 if (y == 5) return x4 * x; //Cost 3 if (y == 6) return x4 * x2; //Cost 3 if (y == 7) return x4 * x2 * x; //Cost 4 float x8 = x4 * x4; //Cost 3 if (y == 8) return x8; //Cost 3 if (y == 9) return x8 * x; //Cost 4 if (y == 10) return x8 * x2; //Cost 4 if (y == 11) return x8 * x2 * x; //Cost 5 if (y == 12) return x8 * x4; //Cost 4 if (y == 13) return x8 * x4 * x; //Cost 5 if (y == 14) return x8 * x4 * x2; //Cost 5 float x16 = x8 * x8; //Cost 4 if (y == 16) return x16; //Cost 4 if (y == 17) return x16 * x; //Cost 5 if (y == 18) return x16 * x2; //Cost 5 if (y == 20) return x16 * x4; //Cost 5 if (y == 24) return x16 * x8; //Cost 5 if (y == 32) return x16 * x16; //Cost 5 return pow(x, y); } If the drivers would do this themselves, it would probably be better to just leave the pow(x, y) there, because they know better when to optimize it. I'd obviously only use this when y is constant. I obviously don't want to have any dynamic branching here.

Graphics and GPU Programming Programming

Started by CryZe October 09, 2012 11:23 AM

14 comments, last by MJP 11 years, 6 months ago

Lightness1024

939

October 09, 2012 09:19 PM

darn, I didn't know drivers where doing JIT

, I always assumed it was only statically analyzed.

CryZe

773

Author

October 10, 2012 12:58 PM

Thanks for the answers. Good to know, that FXC optimizes pow(x, 2). I thought, that I checked that, but looks like I didn't

But I'm still pretty sure that it won't optimize it for other literals (couldn't test it in the meantime though). The thing is that pow(x, 2) isn't really what's interesting about it. I just put it in the title to make clear what this topic is about. It's pretty obvious that a single MUL is always faster or at least as fast as a POW. It get's more interesting for other literals though. Especially when one is implementing Schlick's approximation of Fresnel.

One could implement it this way: (wow gamedev.net can't handle multiple lines of code in the code-tag right now, that explains why so many users post such misaligned code right now)
[font=courier new,courier,monospace]

[color=#0000FF]float rLDotH = 1 - LDotH;
[color=#0000FF]float rLDotH2 = rLDotH * rLDotH;
[color=#0000FF]float rLDotH5 = rLDotH2 * rLDotH2 * rLDotH;
[color=#0000FF]float fresnel = reflectivity + (1 - reflectivity) * rLDotH5;
[color=#008000]//Or even simpler:
//float fresnel = reflectivity + (1 - reflectivity) * constpow(1 - LDotH, 5);
//which has the same effect on the resulting assembly[/font]

Or this way:
[font=courier new,courier,monospace]

[color=#0000FF]float fresnel = reflectivity + (1 - reflectivity) * pow(1 - LDotH, 5);[/font]

Which one is the preferable implementation? Like I said, I had no time to do some benchmarks, but I'll do some. But the thing is, that my graphics card might have a pretty slow or really fast POW instruction compared to other graphics cards. So it might not be that representative of all graphics cards. So a single benchmark won't tell me which implementation is the more preferable in the average case.

Lightness1024

939

October 10, 2012 09:27 PM

maybe you could try to guess from the assembly how many ALU vs SFU slots you use in your shader and try to balance that according to the ratio of the respectuve units in your target average card.

chrisATI

266

October 11, 2012 03:17 PM

AMD has a tool called GPU Shader Analyzer that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.

Dave Eberly

1,182

October 15, 2012 07:05 AM

AMD has a tool called GPU Shader Analyzer that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.

The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...

MJP

20,295

October 15, 2012 07:26 AM

The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...

That's weird...the SM 4.0, 4.1, and 5.0 profiles are all available for me without doing anything on the command line, and they work just fine.

The Blog | The Book

Are GPU drivers optimizing pow(x,2)?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Are GPU drivers optimizing pow(x,2)?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines