• Advertisement
Sign in to follow this  

Extremely fast sin approximation

This topic is 2237 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

This is a extremely fast sin approximation I have been working on for a while and I thought that I'd share it. Before starting off with the code and how I derived this approximation, let's start off with some data:

fast_sin time: 148.4ms
sinf time: 572.7ms
sin time: 1231.2ms
Worst error: 0.000296
Average error: 0.000124
Average relative error: 0.02%


As you can see, this approximation is around 3.9 times as fast as sinf and 8.3 times as fast as the default sin. This was tested with all optimizations turned on. The worst error is 0.000296, meaning that this approximation is very usable. On top of that, on the points 0, pi/6, pi/2 (and multiples of those) fast_sin gives the exact correct results.

Now, this isn't all magic and ponies, there of course are downsides. The biggest speed-up I've gotten is using a fast truncate (more about this later). This method relies on IEEE754 and that the FPU is in double-precision mode. If either of those two aren't available the trick can not be used and we have to do a slow truncate (cast-to-int). This slows the routine down by about a factor 2. It's still a lot faster, but not as fast as it could be.

What implications does this have? Well, for most of us: none. On gamedev here we mostly target Windows and sometimes Mac & Linux. All of those are little endian. And if you don't mess around with your FPU it is most likely in double-precision mode.

Allright, now let's see the code:

/* uncomment the next line if you're on a big-endian system */
/* #define BIG_ENDIAN */

/* uncomment the next line if you can not assume double-precision FPU or IEEE754 */
/* #define NO_FAST_TRUNCATE */

/* we need to do some custom hacking for MSVC */
#ifdef _MSC_VER
typedef __int32 int32_t;
#else
#include <stdint.h>
#endif

inline int32_t fast_round(double x) {
#ifndef NO_FAST_TRUNCATE

const double MAGIC_ROUND = 6755399441055744.0; /* http://stereopsis.com/sree/fpu2006.html */

union {
double d;

struct {
#ifdef BIG_ENDIAN
int32_t hw;
int32_t lw;
#else
int32_t lw;
int32_t hw;
#endif
};
} fast_trunc;

fast_trunc.d = x;
fast_trunc.d += MAGIC_ROUND;

return fast_trunc.lw;

#else

if (x < 0) {
return (int32_t) (x - 0.5);
} else {
return (int32_t) (x + 0.5);
}

#endif
}

inline double fast_sin(double x) {
const double PI = 3.14159265358979323846264338327950288;
const double INVPI = 0.31830988618379067153776752674502872;
const double A = 0.00735246819687011731341356165096815;
const double B = -0.16528911397014738207016302002888890;
const double C = 0.99969198629596757779830113868360584;

int32_t k;
double x2;

/* find offset of x from the range -pi/2 to pi/2 */
k = fast_round(INVPI * x);

/* bring x into range */
x -= k * PI;

/* calculate sine */
x2 = x * x;
x = x*(C + x2*(B + A*x2));

/* if x is in an odd pi count we must flip */
if (k % 2) x = -x;

return x;
}


Now that doesn't seem to be very clear, with magic constants all over the place. I will explain those later. Note the two macros at the top, you should define BIG_ENDIAN if you're targeting a big-endian system, and if you can not assume IEEE754 or double-precision mode FPU you must define NO_FAST_TRUNCATE (this slows the method down).

Allright, the explanation! First I will explain how I calculate the sine, with some highschool math. I have the idea from this devmaster thread. If you draw the sine function, you can see that in the range [-PI/2, PI/2] looks like a regular fifth degree polynomial. To approximate sin(x) I chose the formula [font=courier new,courier,monospace]f(x) = ax^5 + bx^3 + cx[/font]. In order to get a good approximation I made an equation system with known points of sin(x). I set f(pi/2) to 1, f(pi/6) to 1/2, and the derivative of f(pi/2) to 0 (if a derivative is zero then the formula is "flat" on that point, which is exactly what we want). This results in the following equation system and solution:

Equations:
f(x) = ax^5 + bx^3 + c
f(pi/2) = 1
f'(pi/2) = 0
f(pi/6) = 1/2
Solution:
a = 9 / (4 * pi^5)
b = -41 / (8 * pi^3)
c = 201 / (64 * pi)


However, we still have a major problem. This is only usable in the range [-PI/2, PI/2] and not outside of it! This is possible to solve with a modulus, but that is way to expensive. Instead I divide x by PI (or multiply by 1/PI) and then round it to an integer. Then I substract that integer times PI from the original x to simulate a modulo.

We have two more problems. If x is in the range [PI/2, PI], sin(x) is negative. So far we haven't accounted for that. The easiest solution is to re-use the integer calculated above. If that integer is odd, we are in the negative part of the sine function, and we must flip our result.

Our last problem is rounding a double to an integer. In my original code I found that 50% of the time is spent in the [font=courier new,courier,monospace]round[/font] function. After searching the internet for a while I found this page by Sree Kotay giving a solution. And it works, beautifully, solving the final problem.

The great part about this also is is that it doesn't rely on the standard library at all. This means that this is very usable for demos and such (that often don't link to the standard library for code size).

And finally a "cool" version that has every comment and useful stuff removed. This version only works in a little endian, IEEE754 with double-precision FPU environment (which is pretty much the default Windows environment), but in the way it's written is very easily converted to assembler:

double fast_sin(double x) {
int k;
double y;
double z;

z = x;
z *= 0.3183098861837907;
z += 6755399441055744.0;
k = *((int *) &z);
z = k;
z *= 3.1415926535897932;
x -= z;
y = x;
y *= x;
z = 0.0073524681968701;
z *= y;
z -= 0.1652891139701474;
z *= y;
z += 0.9996919862959676;
x *= z;
k &= 1;
k += k;
z = k;
z *= x;
x -= z;

return x;
}

Share this post


Link to post
Share on other sites
Advertisement
As you say, your approximation does not work for the full period. I may have had some errors in my translation of the code, but the results are negated outside the range from -pi/2 to +pi/2. If these discontinuities are not supposed to be here, then the polynomial approximation is not suitable for a whole period anyway, since the approximate diverges then the input approaches -pi and +pi. I'm unsure of you refer to the discontinuities or the divergence. For a full period, these coefficients are the optimal in the least square sense:
[source]
A = 0.005600988812781;
B = -0.154891261471231;
C = 0.987233286863364;
[/source]
That is, these coefficients are the ones that minimize the average of the squared error. The approximation with these coefficients has a uniform minimum error across the whole period of the sine function.

edit: I just realized that the discontinuity is because you wrap k and x to the range [-pi/2, pi/2]. If you wrap it to [-pi, pi] and use my coefficients instead, you use the same code and the approximation is valid for the whole period. But in the end, the approximation is nothing more than a standard polynomial approximation.

Share this post


Link to post
Share on other sites
I just profiled this and I'm getting std::sin to be about the same as your fast_sin.

Profile results:


--------+----------------------------------------------
Samples | Line of code
43 | float fast = static_cast<float>(fast_sin(d));
43 | float standard = std::sin(d);
|
395 | double fast = fast_sin(d);
402 | double standard = std::sin(d);
--------+----------------------------------------------


Profiler Settings:

/include:fast_sin /include:std::sin
Profiling collection: CPU sampling


Project Optimization Settings:

Release
Maximize Speed (/O2)
Enable Intrinsic Functions: Yes (/Oi)
Favor Size or Speed: Favor faster code (/Ot)
(everything else are defaults for Visual Studio Release build)


Testing Set up:

Visual Studio 2010 Ultimate
Microsoft Windows 7 64-bit
Intel Core 2 Duo 2.66 GHz
MacBook Pro5,1
No other applications were open


Code used in profiling:


#include <iostream>
#include <cmath>

double fast_sin(double x) {
int k;
double y;
double z;

z = x;
z *= 0.3183098861837907;
z += 6755399441055744.0;
k = *((int *) &z);
z = k;
z *= 3.1415926535897932;
x -= z;
y = x;
y *= x;
z = 0.0073524681968701;
z *= y;
z -= 0.1652891139701474;
z *= y;
z += 0.9996919862959676;
x *= z;
k &= 1;
k += k;
z = k;
z *= x;
x -= z;

return x;
}

void runTestDouble()
{
double max = 0.0;
double start = -10000.0;
double end = 10000.0;
double step = 0.00001;
double d = start;

while (d < end)
{
double fast = fast_sin(d);
double standard = std::sin(d);
double error = std::abs(fast - standard);
max = std::max(max, error);
d += step;
}

std::cout << "Double max error: " << max << std::endl;
}

void runTestFloat()
{
float max = 0.0f;
float start = -900.0f;
float end = 900.0f;
float step = 0.001f;
float d = start;

while (d < end)
{
float fast = static_cast<float>(fast_sin(d));
float standard = std::sin(d);
float error = std::abs(fast - standard);
max = std::max(max, error);
d += step;
}

std::cout << "Float max error: " << max << std::endl;
}

int main()
{
runTestDouble();
for (int i = 0; i < 100; ++i)
{
runTestFloat();
}
}


Program results:

Double max error: 0.000296046
Float max error: 0.000296056


Yes, the other parts of the code dominate the execution time/CPU samples (especially std::abs()), but I don't see a significant difference between the two functions, and certainly not a factor of ~8.3.

Share this post


Link to post
Share on other sites

I just profiled this and I'm getting std::sin to be about the same as your fast_sin.

Profile results:


--------+----------------------------------------------
Samples | Line of code
43 | float fast = static_cast<float>(fast_sin(d));
43 | float standard = std::sin(d);
|
395 | double fast = fast_sin(d);
402 | double standard = std::sin(d);
--------+----------------------------------------------


Profiler Settings:

/include:fast_sin /include:std::sin
Profiling collection: CPU sampling


Project Optimization Settings:

Release
Maximize Speed (/O2)
Enable Intrinsic Functions: Yes (/Oi)
Favor Size or Speed: Favor faster code (/Ot)
(everything else are defaults for Visual Studio Release build)


Testing Set up:

Visual Studio 2010 Ultimate
Microsoft Windows 7 64-bit
Intel Core 2 Duo 2.66 GHz
MacBook Pro5,1
No other applications were open


Code used in profiling:


#include <iostream>
#include <cmath>

double fast_sin(double x) {
int k;
double y;
double z;

z = x;
z *= 0.3183098861837907;
z += 6755399441055744.0;
k = *((int *) &z);
z = k;
z *= 3.1415926535897932;
x -= z;
y = x;
y *= x;
z = 0.0073524681968701;
z *= y;
z -= 0.1652891139701474;
z *= y;
z += 0.9996919862959676;
x *= z;
k &= 1;
k += k;
z = k;
z *= x;
x -= z;

return x;
}

void runTestDouble()
{
double max = 0.0;
double start = -10000.0;
double end = 10000.0;
double step = 0.00001;
double d = start;

while (d < end)
{
double fast = fast_sin(d);
double standard = std::sin(d);
double error = std::abs(fast - standard);
max = std::max(max, error);
d += step;
}

std::cout << "Double max error: " << max << std::endl;
}

void runTestFloat()
{
float max = 0.0f;
float start = -900.0f;
float end = 900.0f;
float step = 0.001f;
float d = start;

while (d < end)
{
float fast = static_cast<float>(fast_sin(d));
float standard = std::sin(d);
float error = std::abs(fast - standard);
max = std::max(max, error);
d += step;
}

std::cout << "Float max error: " << max << std::endl;
}

int main()
{
runTestDouble();
for (int i = 0; i < 100; ++i)
{
runTestFloat();
}
}


Program results:

Double max error: 0.000296046
Float max error: 0.000296056


Yes, the other parts of the code dominate the execution time/CPU samples (especially std::abs()), but I don't see a significant difference between the two functions, and certainly not a factor of ~8.3.


That is weird.

I profiled with gcc -O3, with MinGW. It might be possible that glibc's sin is incredibly slow compared to VC's, I don't know.

Either way, this is how I profiled (it uses stopwatch from https://github.com/n...src/stopwatch.c):


#include <stdio.h>
#include <math.h>
#include "commonc/stopwatch.h"

double fast_sin(double x) {
int k;
double y;
double z;

z = x;
z *= 0.3183098861837907;
z += 6755399441055744.0;
k = *((int *) &z);
z = k;
z *= 3.1415926535897932;
x -= z;
y = x;
y *= x;
z = 0.0073524681968701;
z *= y;
z -= 0.1652891139701474;
z *= y;
z += 0.9996919862959676;
x *= z;
k &= 1;
k += k;
z = k;
z *= x;
x -= z;

return x;
}
int main(int argc, char **argv) {
volatile double prevent_opti;
int i;
double d;
cc_stopwatch_t timer;
/* cc_sin test */
printf("Testing cc_sin\n");
/* performance check */
{
cc_stopwatch_start(&timer);
for (i = 0; i < 10000; i++) {
for (d = -16*M_PI; d < 16*M_PI; d += 0.1) {
prevent_opti = fast_sin(d);
}
}
printf("fast_sin time: %f\n", cc_stopwatch_gettime(&timer));
cc_stopwatch_start(&timer);
for (i = 0; i < 10000; i++) {
for (d = -16*M_PI; d < 16*M_PI; d += 0.1) {
prevent_opti = sin(d);
}
}
printf("sin time: %f\n", cc_stopwatch_gettime(&timer));
}
(void) prevent_opti;
return 0;
}


Giving me these results (consequently, not one time):

fast_sin time: 147.236370
sin time: 1252.330147


I used GCC 4.6.1, 64 bit Windows 7 SP 1, and 2.2GHz single-core Celeron.

Share this post


Link to post
Share on other sites
To get maximum speed out of gcc my experience is that you in addition to -O3 should also try -ffast-math and appropriate SSE settings (-msse3, -msse4 or even -mavx)? This will obviously throw strict ieee compliance out of the window. But then you are not really after that when using fast approximations ;).

Edit:
A quick test on my Atom N550 netbook "-o3 -ffast-math -msse3" gives me about factor 1.8 speedup for your version.

Edit2:
gcc-4.5 "-O3 -msse3 -ffast-math" on atom N550 @ 1.5GHz

fast sin time: 794.911000 ms
sin time: 1424.415000 ms
[/quote]

icc (12.0.4) "-fast" on atom N550 @ 1.5GHz

fast sin time: 574.267149 ms
sin time: 745.954990 ms
[/quote]

comparing the times between compilers might not be appropriate since it also measures some loop overhead potentially (-funroll-loops doesn't change timings on gcc though)

linux compilable version:
[source]
#include <stdio.h>
#include <math.h>
#include <sys/time.h>


double mysecond()
{
struct timeval tp;
struct timezone tzp;

gettimeofday(&tp,&tzp);
return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

double fast_sin(double x) {
int k;
double y;
double z;

z = x;
z *= 0.3183098861837907;
z += 6755399441055744.0;
k = *((int *) &z);
z = k;
z *= 3.1415926535897932;
x -= z;
y = x;
y *= x;
z = 0.0073524681968701;
z *= y;
z -= 0.1652891139701474;
z *= y;
z += 0.9996919862959676;
x *= z;
k &= 1;
k += k;
z = k;
z *= x;
x -= z;

return x;
}
int main(int argc, char **argv) {
volatile double prevent_opti;
int i;
double d;
double start, end;
/* cc_sin test */
printf("Testing cc_sin\n");
/* performance check */
{
start = mysecond();
for (i = 0; i < 10000; i++) {
for (d = -16*M_PI; d < 16*M_PI; d += 0.1) {
prevent_opti = fast_sin(d);
}
}
end = mysecond();

printf("fast sin time: %f ms\n", (end-start)*1.e3);

start = mysecond();
for (i = 0; i < 10000; i++) {
for (d = -16*M_PI; d < 16*M_PI; d += 0.1) {
prevent_opti = sinf(d);
}
}
end = mysecond();

printf("sin time: %f ms\n", (end-start)*1.e3);
}
(void) prevent_opti;
return 0;
}

[/source]

Share this post


Link to post
Share on other sites
Hidden

To get maximum speed out of gcc my experience is that you in addition to -O3 should also try -ffast-math and appropriate SSE settings (-msse3, -msse4 or even -mavx)? This will obviously throw strict ieee compliance out of the window. But then you are not really after that when using fast approximations ;).

Edit:
A quick test on my Atom N550 netbook "-o3 -ffast-math -msse3" gives me about factor 1.8 speedup for your version.


-ffast-math did improve the time of sin to that of sinf, but that's it (it did nothing for my function). The SSE settings did nothing.

Share this post


Link to post
Cool, thanks for posting. Could be useful.

I can't help but point out, though, that the fastest [font=courier new,courier,monospace]sin [/font]in the world is to just represent angles as unit vectors -- in which case "sine" just means retrieving the second element. And if you need the sine of the angle between two vectors, the determinant (in 2d) or cross product (in 3d) gives you that value for the cost of a handful of ADDs and MULs.

Similarly, if you need a whole sequence of [font=courier new,courier,monospace]sin[/font] values (e.g., you want to generate a sine wave), then repeatedly rotating a unit vector can do what you need very efficiently.

The only non-primitive operation that you cannot avoid is the reciprocal square root.

Still, it's nice to have a full toolbox. So, thanks.

Share this post


Link to post
Share on other sites
If you're going to do polynomial approximations, you're better off with using Chebyshev polynomials. Though I really don't see the point of this. Are there any plausible applications which both support fast floating point math and have trig functions as a bottleneck?

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement