# A Fast pow(a,b) and SSE

This topic is 2341 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi,

So, my ray-tracer takes a very large amount of time (28%!) doing pow calculations. Despite this, I have been unable to find anything faster. I have tested:
http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/
http://www.dctsystems.co.uk/Software/power.html

And none of them give any speedup, even for the quality hits they incur. I'm using Visual Studio with full optimizations. I am now attempting to use the SSE pow functions here. However, I am running into a problem:Error: variable "__m128" is not a type name.. . . and, when attempting to compile:error C2146: syntax error : missing ';' before identifier '__m128'I've looked all over for how to fix this, but I can't solve it. I have tried:[source lang="cpp"]#include <xmmintrin.h>[/source]. . . xor:[source lang="cpp"]#include <emmintrin.h>[/source]. . . but to no avail.

CPU is Intel Core 2 Duo T8300 (2.4GHz, dual core)

Thanks,
G

##### Share on other sites
What part of the computation makes use of pow? Is it the evaluation of the Phong reflection model? Perhaps there are ways to avoid calling pow so many times to begin with.

##### Share on other sites
Which compiler are you using? What version?

##### Share on other sites
The pow function is being used for calculating specular and for calculating hemispherical random samples weighted by cons^n(theta).

The compiler is Visual Studio 2010 Ultimate's compiler.

##### Share on other sites
Post all your source for the file which you can't compile...

##### Share on other sites
Post all your source for the file which you can't compile...
It's that code from the article I linked. I added the #includes which I thought would make it work, but it still fails early:[source lang="cpp"]#pragma once
#include <xmmintrin.h> //I've alternately tried: "#include <emmintrin.h>"

#define EXP_POLY_DEGREE 3

#define POLY0(x, c0) _mm_set1_ps(c0)
#define POLY1(x, c0, c1) _mm_add_ps(_mm_mul_ps(POLY0(x, c1), x), _mm_set1_ps(c0))
#define POLY2(x, c0, c1, c2) _mm_add_ps(_mm_mul_ps(POLY1(x, c1, c2), x), _mm_set1_ps(c0))
#define POLY3(x, c0, c1, c2, c3) _mm_add_ps(_mm_mul_ps(POLY2(x, c1, c2, c3), x), _mm_set1_ps(c0))
#define POLY4(x, c0, c1, c2, c3, c4) _mm_add_ps(_mm_mul_ps(POLY3(x, c1, c2, c3, c4), x), _mm_set1_ps(c0))
#define POLY5(x, c0, c1, c2, c3, c4, c5) _mm_add_ps(_mm_mul_ps(POLY4(x, c1, c2, c3, c4, c5), x), _mm_set1_ps(c0))

__m128 exp2f4(__m128 x) // <= FAILS HERE!!!
{
__m128i ipart;
__m128 fpart, expipart, expfpart;

//...[/source]-G

##### Share on other sites
Did you try #include <Windows.h> ?

Often when something fails for that reason, and in such a simple single header case, it's because the header depends on a define/typedef from the windows.h header...

It could also be a case of fail for a header included before the one you pasted... (I'm assuming it's a header due to the #pragma once in it)

##### Share on other sites
Did you try #include <Windows.h> ?

Often when something fails for that reason, and in such a simple single header case, it's because the header depends on a define/typedef from the windows.h header...[/quote]I attempted:[source lang="cpp"]#include <Windows.h>
#include <xmmintrin.h>
#include <emmintrin.h>[/source]. . . but it didn't work.
It could also be a case of fail for a header included before the one you pasted... (I'm assuming it's a header due to the #pragma once in it)[/quote]Yes, it is a header, and nope, everything else works fine.
Thanks,
-G

##### Share on other sites
Not sure how this works in VS but for GCC you usually also have to turn on SSE instructions in the compiler to be able to use those headers.

##### Share on other sites
I just tried pasting that code into a new console app using VS2010 Express, and using both includes it compiled just fine.

Commenting out all the includes gives "error C2146: syntax error : missing ';' before identifier 'exp2f4'". Note the differing location of the error message.

I suspect your problem is a missing ; at the end of a class definition in another file.

##### Share on other sites
Hi,

I looked all over the code again, and it appears that, asininely, I had an extra line of syntactically incorrect code farther down in the file. Removing it solved the problem. It would have been helpful if the error was there instead. Oh well.

Thanks!
-G

##### Share on other sites
Is your custom pow(a, b) function faster than that provided by std? Because I spent a while thinking the one included with C# is slow - Turns out? Very fast.

##### Share on other sites
Is your custom pow(a, b) function faster than that provided by std? Because I spent a while thinking the one included with C# is slow - Turns out? Very fast.
Nope, all my attempts were slower. Which is really lame. That's why I was was hoping for SSE optimizations to actually get something faster. I've since moved on to other things (for the time being, I had simply precomputed the random samples and the specular exponents, which works almost as well).

Thanks,

##### Share on other sites

[quote name='Narf the Mouse' timestamp='1316919846' post='4865631']Is your custom pow(a, b) function faster than that provided by std? Because I spent a while thinking the one included with C# is slow - Turns out? Very fast.
Nope, all my attempts were slower. Which is really lame. That's why I was was hoping for SSE optimizations to actually get something faster. I've since moved on to other things (for the time being, I had simply precomputed the random samples and the specular exponents, which works almost as well).

Thanks,
[/quote]
Not really; the default one could be executing as little as a single assembly instruction.

Also, specular sounds like something that should be computed on the graphics card.

##### Share on other sites
Not really; the default one could be executing as little as a single assembly instruction.
Even if that were so, it wouldn't execute in one clock cycle. "pow" typically takes more than 150. Obviously, if you vectorize the pow, then you can compute multiple pow calls at once, which is going to be more efficient.
Also, specular sounds like something that should be computed on the graphics card.[/quote]Yes, my GPU ray tracer does that. My CPU one (the one I'm working on) works entirely on the CPU. In any case, doing only vectorized pow calculations on the graphics card (which has slower clock speeds, more processors) doesn't make sense--you waste more time waiting on the graphics bus than you save in compute time.

##### Share on other sites

Nope, all my attempts were slower. Which is really lame. That's why I was was hoping for SSE optimizations to actually get something faster. I've since moved on to other things (for the time being, I had simply precomputed the random samples and the specular exponents, which works almost as well).

Thanks,

pow() itself is an extremely slow function in general.
You won’t get anything faster than the pow() used on the GPU and even with that I got noticeably slower frame rates with higher specular values (higher values passed to pow()).
If I recall correctly, I started with something like 2,000 FPS with a pow( X, 2 ). It dropped to something like 1,700 with a pow( X, 127 ).
0.09 milliseconds per second different for just a single object, and this was on a GPU composed of two ATI Radeon HD 5870’s crossfired! You can bet those higher specular values are going to do damage to your CPU implementation.

Most likely your fastest average result will come from creating a look-up table, the same way cos() and sin() are often handled in games.
A table from 0x0000 to 0xFFFF should give enough accuracy for powers ranging from 2 to 127 (2-127 being the typical range of valid specular components).
A look-up table will probably be slower than small specular values, but faster than high ones.
It should certainly be faster than those suggested by the articles you posted.

Since the first parameter will always be between 0 and 1, it is reasonable to make a matrix look-up table for this.

L. Spiro

##### Share on other sites
[color=#1C2837][size=2]
Most likely your fastest average result will come from creating a look-up table, the same way cos() and sin() are often handled in games.
A table from 0x0000 to 0xFFFF should give enough accuracy for powers ranging from 0 to 127.
A look-up table will probably be slower than small specular values, but faster than high ones.
It should certainly be faster than those suggested by the articles you posted.[/quote]Yeah. I gave each material a pow table. Because each material's exponent is fixed, the table only need vary the "a" in pow(a,b), (the "a" varies from [0.0,1.0]). This means that the table can be much smaller, and more accurate. In my specular calculation, I have it average the two nearest values in the table. The whole render as a whole is 22% faster, and I can't tell the difference in quality.
[color=#1C2837][size=2]Thanks,

##### Share on other sites
I started with something like 2,000 FPS with a pow( X, 2 ). It dropped to something like 1,700 with a pow( X, 127 ).

That's because the compiler is smart enough to replace pow(X, 2) with X * X.

##### Share on other sites
I don’t know that my lower limit was exactly 2.
When I first noticed that low powers were faster than higher powers, I started plugging in a few more numbers starting low and going up. Not whole numbers (I was using a slider, so every value I tested had fractions).
It consistently decreased the frame rate as I went higher and higher.

Actually it is not possible for the compiler to optimize this anyway, because 2 and 127 (and every value between) were not hard-coded numbers. They were shader uniforms.

L. Spiro

##### Share on other sites
If b is an integer, you can try using addition chains. At one former employer, it gave a big speed improvement in time value of money calculations. Though I guess there's a good chance that's what the VS compiler does under the hood.

##### Share on other sites
Unfortunately, the exponent isn't necessarily integral.