• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
CryZe

Are GPU drivers optimizing pow(x,2)?

15 posts in this topic

The power function usually is about 6 times slower than a simple mad-instruction (at least on current NVidia GPUs). The HLSL compiler itself doesn't optimize it and simply converts the pow into a LOG, a MUL and an EXP instruction. But most constant powers up to x^32 would actually be faster to be calculated by using just MUL instructions. Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this? Here is the function I would be using, if I had to optimize it myself:

[CODE]float constpow(float x, uint y)
{
if (y == 0)
return 1; //Cost 0

if (y == 1)
return x; //Cost 0

float x2 = x * x; //Cost 1

if (y == 2)
return x2; //Cost 1

if (y == 3)
return x2 * x; //Cost 2

float x4 = x2 * x2; //Cost 2

if (y == 4)
return x4; //Cost 2

if (y == 5)
return x4 * x; //Cost 3

if (y == 6)
return x4 * x2; //Cost 3

if (y == 7)
return x4 * x2 * x; //Cost 4

float x8 = x4 * x4; //Cost 3

if (y == 8)
return x8; //Cost 3

if (y == 9)
return x8 * x; //Cost 4

if (y == 10)
return x8 * x2; //Cost 4

if (y == 11)
return x8 * x2 * x; //Cost 5

if (y == 12)
return x8 * x4; //Cost 4

if (y == 13)
return x8 * x4 * x; //Cost 5

if (y == 14)
return x8 * x4 * x2; //Cost 5

float x16 = x8 * x8; //Cost 4

if (y == 16)
return x16; //Cost 4

if (y == 17)
return x16 * x; //Cost 5

if (y == 18)
return x16 * x2; //Cost 5

if (y == 20)
return x16 * x4; //Cost 5

if (y == 24)
return x16 * x8; //Cost 5

if (y == 32)
return x16 * x16; //Cost 5

return pow(x, y);
}
[/CODE]

If the drivers would do this themselves, it would probably be better to just leave the pow(x, y) there, because they know better when to optimize it. I'd obviously only use this when y is constant. I obviously don't want to have any dynamic branching here. Edited by CryZe
0

Share this post


Link to post
Share on other sites
I guess the best thing to do would be to run some tests.

Why don't you run some full screen quads using each method and use something like pix/perfHud to see how long each one ended up taking? Shouldn't take too long.

(only saying as I don't know the answer to your question) I wouldn't have though they would optimize in that way and in a number of cases you will now the power so you can write them out with muls on a case by case basis.
0

Share this post


Link to post
Share on other sites
A branching version is quite likely to run even slower than just calling pow directly (especially so with GPU code where not all pixels in a group may take the same path), so I wouldn't even contemplate doing it that way. If you know that your exponent is always going to be in a certain range you may be able to encode it into a 1D texture, but I haven't benchmarked that against just using pow so I can't say anything regarding performance comparisons.
1

Share this post


Link to post
Share on other sites
[quote name='CryZe' timestamp='1349781792' post='4988293']
Now the question is: Should I bother optimizing it myself, or do the drivers usually optimize something like this?
[/quote]
This is the old discussion of 'oh, if I write this cool pretty fast asm, I can beat the compiler' topic. The general consense on this topic is, that the compiler will be most likely better than your handwritten optimization, maybe not today, but tomorrow. Especially if you consider different plattforms and future driver and hardware updates.
1

Share this post


Link to post
Share on other sites
If you compile with fxc you can output the intermediate directx assembly text file, and just check what pow(x,2) was transformed to.
be careful, it strongly depends on the shader profile you target. for example, sin(x) is expanded into 4 mad in shader model 1, but it uses sincos(x) from shader model 2 since it is an intrinsic. there are similar differences with x*2 often implemented as x+x but it varies according to profiles.
Also always keep in mind there is a second compilation stage by the driver.
most likely the pow(x,2) will already be mutated into some cheaper ALU calls by fxc compiler itself.
like previously mentioned, test perf in practice to be sure, but I have a feeling it will be difficult to see any difference because the noise might be greater than the difference in that case.
don't forget to post results here ^^
cheers
0

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1349786374' post='4988319']
Actually, we wouldn't have been able to ship our last game running at 30Hz on the min specs if we didn't hand optimize all the HLSL code to do things that I originally assumed the compiler would be smart enough to do.
[/quote]
The interesting question is, did you leave this optimization for all hardware, or did you use (pre-processor-)branching to optimize your shader for certain videochip classes only. I still think, that a general optimization is a bad idea, though I can understand that external pressure (this title must run at 30 fps on my mother PCs) is a good reason to bend this rule [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]
0

Share this post


Link to post
Share on other sites
[quote name='Ashaman73' timestamp='1349787052' post='4988324']The interesting question is, did you leave this optimization for all hardware, or did you use (pre-processor-)branching to optimize your shader for certain videochip classes only... I can understand that external pressure (this title must run at 30 fps on my mother PCs)[/quote]We hand optimized only for the [i]almost-min-spec[/i] because the vast majority of sales are for it, rather than say, DX11 PCs ([i]and they can run it regardless of perfect HLSL code[/i] [img]http://public.gamedev.net//public/style_emoticons/default/tongue.png[/img]). For [i]absolute-min-spec[/i], we just by default disabled some features to get frame-rate up, and didn't care too much because it's only your mother and she won't notice.
Also, as you said, compilers are always getting better, so hopefully a modern PC's driver can pull apart our hand-vectorized shader code and put it back together into "modern" efficient code. On this topic -- the Unity guys actually compile their GLSL code (performing optimisations), and then output the results as regular text GLSL code -- so that on drivers with bad GLSL compilers, the result is still optimized!
Sorry for off-topic [img]http://public.gamedev.net//public/style_emoticons/default/ph34r.png[/img]

@OP - are you using literal pow values often enough to warrant this effort? [img]http://public.gamedev.net//public/style_emoticons/default/wink.png[/img] I can only remember using constants of maybe 2/3/4/5, and I've just written your unrolled versions in-place for those cases. If a compiler is smart enough to realize that pow(x,2)==x*x, then it should also be smart enough to realise that (x*x)*(x*x)==pow(x,4) and pick the best anyway -- so if the hand optimisation is harmful to a new GPU with a smart compiler, it would be able to undo your cleverness. Edited by Hodgman
2

Share this post


Link to post
Share on other sites
[quote name='Hodgman' timestamp='1349788644' post='4988329']
are you using literal pow values often enough to warrant this effort? I can only remember using constants of maybe 2/3/4/5, and I've just written your unrolled versions in-place for those cases. If a compiler is smart enough to realize that pow(x,2)==x*x, then it should also be smart enough to realise that (x*x)*(x*x)==pow(x,4) and pick the best anyway -- so if the hand optimisation is harmful to a new GPU with a smart compiler, it would be able to undo your cleverness.
[/quote]

I'm implementing this BRDF at the moment (Cook-Torrance with GGX distribution, Schlick Fresnel, Walter GGX Geometric Attenuation and modified energy conserving Lambert):
[img]http://unlimitedengine.us.to/brdf.png[/img]
[eqn]\rho(\mathrm{x},\lambda)[/eqn] is Albedo, [eqn]n_1(\mathrm{x},\lambda)[/eqn] is the refractive index of the first medium, [eqn]n_2(\mathrm{x},\lambda)[/eqn] is the refractive index of the second medium and [eqn]\alpha(\mathrm{x})[/eqn] is the roughness of the material.
As you can see, there are quite a few literal pow values. But the implementation of it is WAY lighter than what you're seeing here. Due to Helmholtz reciprocity about 50% of the BRDF can be calculated once per pixel (the view dependent part of it) and most of the calculations can be reused from the view dependent part. So only about 25% actually needs to be calculated per light. And I believe it's worth it. Probably not though [img]http://public.gamedev.net//public/style_emoticons/default/biggrin.png[/img]

But I guess you're right. New GPUs are probably capable of "deoptimizing" my code. I'll do a few benchmarks though. Edited by CryZe
0

Share this post


Link to post
Share on other sites
darn, I didn't know drivers where doing JIT [img]http://public.gamedev.net//public/style_emoticons/default/blink.png[/img] , I always assumed it was only statically analyzed.
0

Share this post


Link to post
Share on other sites
Thanks for the answers. Good to know, that FXC optimizes pow(x, 2). I thought, that I checked that, but looks like I didn't [img]http://public.gamedev.net//public/style_emoticons/default/biggrin.png[/img]

But I'm still pretty sure that it won't optimize it for other literals (couldn't test it in the meantime though). The thing is that pow(x, 2) isn't really what's interesting about it. I just put it in the title to make clear what this topic is about. It's pretty obvious that a single MUL is always faster or at least as fast as a POW. It get's more interesting for other literals though. Especially when one is implementing Schlick's approximation of Fresnel.

One could implement it this way: (wow gamedev.net can't handle multiple lines of code in the code-tag right now, that explains why so many users post such misaligned code right now)
[font=courier new,courier,monospace][size=3][color=#0000FF]float [/color]rLDotH = 1 - LDotH;
[color=#0000FF]float[/color] rLDotH2 = rLDotH * rLDotH;
[color=#0000FF]float[/color] rLDotH5 = rLDotH2 * rLDotH2 * rLDotH;
[color=#0000FF]float[/color] fresnel = reflectivity + (1 - reflectivity) * rLDotH5;
[color=#008000]//Or even simpler:
//float fresnel = reflectivity + (1 - reflectivity) * constpow(1 - LDotH, 5);
//which has the same effect on the resulting assembly[/color][/size][/font]

Or this way:
[font=courier new,courier,monospace][size=3][color=#0000FF]float[/color] fresnel = reflectivity + (1 - reflectivity) * pow(1 - LDotH, 5);[/size][/font]

Which one is the preferable implementation? Like I said, I had no time to do some benchmarks, but I'll do some. But the thing is, that my graphics card might have a pretty slow or really fast POW instruction compared to other graphics cards. So it might not be that representative of all graphics cards. So a single benchmark won't tell me which implementation is the more preferable in the average case. Edited by CryZe
0

Share this post


Link to post
Share on other sites
maybe you could try to guess from the assembly how many ALU vs SFU slots you use in your shader and try to balance that according to the ratio of the respectuve units in your target average card.
0

Share this post


Link to post
Share on other sites
AMD has a tool called [url="http://developer.amd.com/tools/gpu/shader/Pages/default.aspx"]GPU Shader Analyzer[/url] that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.
0

Share this post


Link to post
Share on other sites
[quote name='chris77' timestamp='1349968638' post='4989124']
AMD has a tool called [url="http://developer.amd.com/tools/gpu/shader/Pages/default.aspx"]GPU Shader Analyzer[/url] that will take HLSL/GLSL and show you the actual machine instructions generated by the driver's compiler (you select which GPU you want to target). It can also estimated performance for you and analyze the bottlenecks. It's quite useful for answering these kinds of questions because you can change the HLSL dynamically and watch how the generated code changes.
[/quote]

The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...
0

Share this post


Link to post
Share on other sites
[quote name='Dave Eberly' timestamp='1350284706' post='4990283']
The GUI version appears to limit you to Shader Model 3. Running from a command line, you can get to Shader Model 5 (in theory), but it crashes for me on my Windows 8 machine. I have not resorted to trying this on a Windows 7 machine. The performance counter libraries AMD provides allows you to instrument manually, and they appear to give similar information that the GUI performance tool does. The only nit is that they leak DX objects (buffers and counters during sampling), so if you have any logic to verify that all DX reference counts go to zero on program termination, you have to disable those...
[/quote]

That's weird...the SM 4.0, 4.1, and 5.0 profiles are all available for me without doing anything on the command line, and they work just fine.
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0