Faster Sin and Cos

Started by
37 comments, last by alvaro 7 years, 7 months ago

If you search for these terms you will get a lengthy list of possibilities.
Look-up tables from the old id Tech days are too cache-miss heavy.
Other implementations are only faster if you are re-implementing sincos(), including the Unreal Engine 4 implementation.
Sony presents an idea using Chebyshev polynomials, but we are bound to certain degrees of accuracy based on how far out you wish to expand the Taylor series.

With the goal of making something both fast and accurate in mind, I have come up with the following functions.


float Sin( float _fX ) {
	int32 i32I = int32( _fX * 0.31830988618379067153776752674503f );	// 1 / PI.
	_fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f);

	float fX2 = _fX * _fX;

	return (i32I & 1) ?
		-_fX * (float( 1.0 ) +
			fX2 * (float( -1.66666671633720398e-01 ) +
			fX2 * (float( 8.33333376795053482e-03 ) +
			fX2 * (float( -1.98412497411482036e-04 ) +
			fX2 * (float( 2.75565571428160183e-06 ) +
			fX2 * (float( -2.50368437093584362e-08 ) +
			fX2 * (float( 1.58846852338356825e-10 ) +
			fX2 * float( -6.57978446033657960e-13 )))))))) :
		_fX * (float( 1.0 ) +
			fX2 * (float( -1.66666671633720398e-01 ) +
			fX2 * (float( 8.33333376795053482e-03 ) +
			fX2 * (float( -1.98412497411482036e-04 ) +
			fX2 * (float( 2.75565571428160183e-06 ) +
			fX2 * (float( -2.50368437093584362e-08 ) +
			fX2 * (float( 1.58846852338356825e-10 ) +
			fX2 * float( -6.57978446033657960e-13 ))))))));
}

float Cos( float _fX ) {
	int32 i32I = int32( _fX * 0.31830988618379067153776752674503f );	// 1 / PI.
	_fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f);

	float fX2 = _fX * _fX;

	return (i32I & 1) ?
		-float( 1.0 ) -
			fX2 * (float( -5.00000000000000000e-01 ) +
			fX2 * (float( 4.16666641831398010e-02 ) +
			fX2 * (float( -1.38888671062886715e-03 ) +
			fX2 * (float( 2.48006836045533419e-05 ) +
			fX2 * (float( -2.75369188784679864e-07 ) +
			fX2 * (float( 2.06202765973273472e-09 ) +
			fX2 * float( -9.77589970779790818e-12 ))))))) :
		float( 1.0 ) +
			fX2 * (float( -5.00000000000000000e-01 ) +
			fX2 * (float( 4.16666641831398010e-02 ) +
			fX2 * (float( -1.38888671062886715e-03 ) +
			fX2 * (float( 2.48006836045533419e-05 ) +
			fX2 * (float( -2.75369188784679864e-07 ) +
			fX2 * (float( 2.06202765973273472e-09 ) +
			fX2 * float( -9.77589970779790818e-12 )))))));
}

Performance on PC may vary, from 1.02 times as fast to 2.0 times as fast.
On PlayStation 4 this is around 7 or 8 time as fast.
On Xbox One this is around 2 or 3 times as fast.
Accuracy is no fewer than 6 digits. I implemented a less-accurate version on Final Fantasy XV, so these versions are entirely suitable for any AAA production.

Explanation:

The code starts off by using fmodf() on PI. This is implemented manually via a cast to an integer. This gives it a valid range of ±52,707,130.87185.

cos() and sin() are curves that go up, then down, then up, etc. “i32I & 1” checks for it being an up curve or down curve. i32I represents the number of PI denominators, each even going one way, each odd going another way.

Here is the fancy part.

You will notice that the magic constants start off being close to one-over each odd factorial and each even factorial.

0.16666666666666666666666666666667 = 1 / 6 (6 - 1*2*3)

0.00833333333333333333333333333333 = 1 / 120 (120 = 1*2*3*4*5)

etc.

But by the end, they drift rather significantly.

1/17! = 7.6471637318198164759011319857881e-13

I use

6.57978446033657960e-13

The reason is that the series should normally continue on into infinity, but we cut it short.

In the case of Sin(), if we don’t account for this, our numbers drift low (because we actually use the negative of the constant and it overshoots low).

Using a lower number as I have done accounts for this.

I’ve adjusted each of the constants to account for this type of drift and give the best-possible accuracy for this degree of precision.

The precision here is enough to entirely drive a AAA game such as Final Fantasy XV, Star Ocean 5, and others.

Later I will re-evaluate the constants used in Unreal Engine 4, and then I will post a super-fast version.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Advertisement

you know about this paper,

http://krisgarrett.net/papers/l2approx.pdf

where you can pick one of a few ranges, and a desired accuracy, and it gives you a polynomial for it?

It doesn't look like a officially published journal paper, so I don' know how good it is.

I was not aware of that paper.

Their constants produced results slightly better than my own, so I modified my constants to produce results significantly better than theirs. These might be the best possible constants.
float Sin( float _fX ) {
	int32 i32I = int32( _fX * (1.0f / PI) );
	_fX = (_fX - float( i32I ) * PI);

	float fX2 = _fX * _fX;

	return (i32I & 1) ?
		-_fX * (float( 1.00000000000000000000e+00 ) +
			fX2 * (float( -1.66666671633720397949e-01 ) +
			fX2 * (float( 8.33333376795053482056e-03 ) +
			fX2 * (float( -1.98412497411482036114e-04 ) +
			fX2 * (float( 2.75565571428160183132e-06 ) +
			fX2 * (float( -2.50368472620721149724e-08 ) +
			fX2 * (float( 1.58849267073435385100e-10 ) +
			fX2 * float( -6.58925550841432672300e-13 )))))))) :
		_fX * (float( 1.00000000000000000000e+00 ) +
			fX2 * (float( -1.66666671633720397949e-01 ) +
			fX2 * (float( 8.33333376795053482056e-03 ) +
			fX2 * (float( -1.98412497411482036114e-04 ) +
			fX2 * (float( 2.75565571428160183132e-06 ) +
			fX2 * (float( -2.50368472620721149724e-08 ) +
			fX2 * (float( 1.58849267073435385100e-10 ) +
			fX2 * float( -6.58925550841432672300e-13 ))))))));
}
float Cos( float _fX ) {
	int32 i32I = int32( _fX * (1.0f / PI) );
	_fX = (_fX - float( i32I ) * PI);

	float fX2 = _fX * _fX;

	return (i32I & 1) ?
		float( -1.00000000000000000000e+00 ) -
			fX2 * (float( -5.00000000000000000000e-01 ) +
			fX2 * (float( 4.16666641831398010254e-02 ) +
			fX2 * (float( -1.38888671062886714935e-03 ) +
			fX2 * (float( 2.48006890615215525031e-05 ) +
			fX2 * (float( -2.75369927749125054106e-07 ) +
			fX2 * (float( 2.06207229069832465029e-09 ) +
			fX2 * float( -9.77507137733812925262e-12 ))))))) :
		float( 1.00000000000000000000e+00 ) +
			fX2 * (float( -5.00000000000000000000e-01 ) +
			fX2 * (float( 4.16666641831398010254e-02 ) +
			fX2 * (float( -1.38888671062886714935e-03 ) +
			fX2 * (float( 2.48006890615215525031e-05 ) +
			fX2 * (float( -2.75369927749125054106e-07 ) +
			fX2 * (float( 2.06207229069832465029e-09 ) +
			fX2 * float( -9.77507137733812925262e-12 )))))));
}
My error metric is not based on the maximum error as theirs is, but rather the total error over the whole range from -PI to PI, and my sampling rate is 12 times denser than theirs.
By this metric my new values reduce the error by approximately 34%.

These numbers look significantly different from theirs partly because these numbers are fitted to floats (I just print 20 decimal places so that I can easily switch between float and double).
Their 8-constant cosine begins with 9.99999999919365479957e-01 and 4.99999998886526927002e-01, but these are actually exactly 1.0f and 0.5f when cast to float.
However, for the last constant on sin(), they have chosen 6.58075489175121657026e-13, and I have chosen 6.58925550841432672300e-13, which is significantly different.


When we get into doubles, their values are much more accurate, which suggests they have not fitted their values to floats at all.
Nonetheless, I was still able to find constants that provide a more-accurate result. I will post these later.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You can make those functions significantly faster (at least on PC) by rearranging the expressions to cut down on dependencies between instructions. This let's the CPU pipeline work more efficiently.

The downside is that you lose a little accuracy. You may be able to get some of that back by tweaking the constants, and/or the bracketing for the adds.

Here's what I ended up with:


float Sin(float x)
{
    int32_t i32I = int32_t( x * (1.0f / PI) );
    x = (x - float( i32I ) * PI);

    float fX2 = x * x;
    float fX4 = fX2 * fX2;
    float fX6 = fX2 * fX4;
    float fX8 = fX4 * fX4;
    float fX10 = fX6 * fX4;
    float fX12 = fX6 * fX6;
    float fX14 = fX6 * fX8;

    return (i32I & 1) ?
        -x * (float( 1.00000000000000000000e+00 ) +
            (fX2 * float( -1.66666671633720397949e-01 )) +
            ((fX4 * float( 8.33333376795053482056e-03 )) +
            (fX6 * float( -1.98412497411482036114e-04 ))) +
            ((fX8 * float( 2.75565571428160183132e-06 )) +
            (fX10 * float( -2.50368472620721149724e-08 ))) +
            ((fX12 * float( 1.58849267073435385100e-10 )) +
            (fX14 * float( -6.58925550841432672300e-13 )))
                ):
        x * (float( 1.00000000000000000000e+00 ) +
            (fX2 * float( -1.66666671633720397949e-01 )) +
            ((fX4 * float( 8.33333376795053482056e-03 )) +
            (fX6 * float( -1.98412497411482036114e-04 ))) +
            ((fX8 * float( 2.75565571428160183132e-06 )) +
            (fX10 * float( -2.50368472620721149724e-08 ))) +
            ((fX12 * float( 1.58849267073435385100e-10 )) +
            (fX14 * float( -6.58925550841432672300e-13 )))
                );
}

float Cos(float x)
{
    int32_t i32I = int32_t( x * (1.0f / PI) );
    x = (x - float( i32I ) * PI);

    float fX2 = x * x;
    float fX4 = fX2 * fX2;
    float fX6 = fX2 * fX4;
    float fX8 = fX4 * fX4;
    float fX10 = fX6 * fX4;
    float fX12 = fX6 * fX6;
    float fX14 = fX6 * fX8;

    return (i32I & 1) ?
        float( -1.00000000000000000000e+00 ) - (
            (fX2 * float( -5.00000000000000000000e-01 )) +
            ((fX4 * float( 4.16666641831398010254e-02 )) +
            (fX6 * float( -1.38888671062886714935e-03 ))) +
            ((fX8 * float( 2.48006890615215525031e-05 )) +
            (fX10 * float( -2.75369927749125054106e-07 ))) +
            ((fX12 * float( 2.06207229069832465029e-09 )) +
            (fX14 * float( -9.77507137733812925262e-12 )))
                ) :
        float( 1.00000000000000000000e+00 ) + (
            (fX2 * float( -5.00000000000000000000e-01 )) +
            ((fX4 * float( 4.16666641831398010254e-02 )) +
            (fX6 * float( -1.38888671062886714935e-03 ))) +
            ((fX8 * float( 2.48006890615215525031e-05 )) +
            (fX10 * float( -2.75369927749125054106e-07 ))) +
            ((fX12 * float( 2.06207229069832465029e-09 )) +
            (fX14 * float( -9.77507137733812925262e-12 )))
        );
}

I tested this in a VS 2015 x64 release build. YMMV.

Hi

I'm curious, how do you do the fitting/checking? Do you use something like Mathematica? Or another software you can recommend? Or do you use some test routines you write yourself?

I'm curious, how do you do the fitting/checking? Do you use something like Mathematica? Or another software you can recommend? Or do you use some test routines you write yourself?

I wrote my own tests. Mathematica and other tools are good for getting a result mathematically near-exact, but once you feed those numbers into a computer you get different values.
Fitting the values to floats or doubles means to adjust the numbers to account for truncation that happens when converting to floats or doubles.
Accounting for this requires custom tools.


Later I will describe how I check the accuracy and will post all of the results as well as benchmark tests.
I will additionally benchmark Adam_42’s version as well as formatting the way they show in the paper linked by Alberth.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

Interesting endeavour, but I'll stay with Garrett's implementation (see post by Alberth). It is good enough, within 1 digit of float precision, and "proven", and it is consistently the fastest in my tests (22.5 times faster than the C library).

My results, if anyone is interested:


-O3 -march=skylake -fpmath=sse:
sin               : 37470663    [0.120873]
sin_garrett_c11   : 1585131     [0.005113]
sin_garrett_c11_s : 2178746     [0.007028]
sin_spiro_s       : 3010460     [0.009711]
sin_spiro         : 2755805     [0.008890]
sin_adam42_s      : 5332894     [0.017203]
sin_adam42        : 5332894     [0.017203]
ratio: 23.6388431

with --fast-math also:
sin               : 37378012    [0.120574]
sin_garrett_c11   : 1585772     [0.005115]
sin_garrett_c11_s : 2179679     [0.007031]
sin_spiro_s       : 3010939     [0.009713]
sin_spiro         : 2783895     [0.008980]
sin_adam42_s      : 5320730     [0.017164]
sin_adam42        : 5320730     [0.017164]
ratio: 23.570861

with -fpmath=387 instead:
sin               : 36788379    [0.118672]
sin_garrett_c11   : 1633470     [0.005269]
sin_garrett_c11_s : 2140569     [0.006905]
sin_spiro_s       : 2981539     [0.009618]
sin_spiro         : 2742520     [0.008847]
sin_adam42_s      : 5453327     [0.017591]
sin_adam42        : 5453327     [0.017591]
ratio: 22.521613

with -fpmath=sse,387 instead:
sin               : 36832914    [0.118816]
sin_garrett_c11   : 1636284     [0.005278]
sin_garrett_c11_s : 2141416     [0.006908]
sin_spiro_s       : 2982003     [0.009619]
sin_spiro         : 2740198     [0.008839]
sin_adam42_s      : 5444734     [0.017564]
sin_adam42        : 5444734     [0.017564]
ratio: 22.510098

You can make those functions significantly faster [...] I tested this in a VS 2015 x64 release build. YMMV.

This one is funny. On my machine, with GCC 6.1 (64 bit), it takes about twice as long. Wonder how it can be so different.

You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower.
I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time.
The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate).


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower.
I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time.
The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate).


L. Spiro

Yes, that's true. Running from -PI to +PI in a million steps, and looking at the difference to the C lib function, we get:


sin                : emax=0.000000000000000     eavg=0.000000000000000  sse=0.000000000000000  // unsurprising
sin_garrett_c11    : emax=0.000000291691886     eavg=0.000000051244958  sse=0.000000003505368
sin_garrett_c11_s  : emax=0.000000472727176     eavg=0.000000058468439  sse=0.000000005403125
sin_spiro_s        : emax=0.000000350934626     eavg=0.000000040066997  sse=0.000000003403369
sin_spiro          : emax=0.000000019798559     eavg=0.000000008708427  sse=0.000000000126031
sin_adam42_s       : emax=0.000000592055687     eavg=0.000000044807770  sse=0.000000004318299
sin_adam42         : emax=0.000000019798559     eavg=0.000000008708427  sse=0.000000000126031

That's the double precision version of your function which is the clear winner (the one you posted is single precision). But double precision runs faster than single precision version, anyway. Interestingly, there are no observable rounding errors between yours and adam_42's version, I would have expected that -- after all they perform operations ordered differently, so the results should differ very slightly.

Will be interesting to see how well the C11 version fares. Question is which metric is most important, I'm almost inclined to think "max error".

I can see L. Spiro is doing a good job of getting fast approximations to sin and cos, but I wonder why this is an important problem. What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?


It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?

EDIT: This forum is too smart for its own good. I am making two unrelated comments and I am trying to post them in two separate posts, but the forum combines them. Annoying.

This topic is closed to new replies.

Advertisement