Sign in to follow this  
L. Spiro

Faster Sin and Cos

Recommended Posts

I'm curious, how do you do the fitting/checking? Do you use something like Mathematica? Or another software you can recommend? Or do you use some test routines you write yourself?

I wrote my own tests. Mathematica and other tools are good for getting a result mathematically near-exact, but once you feed those numbers into a computer you get different values.
Fitting the values to floats or doubles means to adjust the numbers to account for truncation that happens when converting to floats or doubles.
Accounting for this requires custom tools.


Later I will describe how I check the accuracy and will post all of the results as well as benchmark tests.
I will additionally benchmark Adam_42’s version as well as formatting the way they show in the paper linked by Alberth.


L. Spiro

Share this post


Link to post
Share on other sites

Interesting endeavour, but I'll stay with Garrett's implementation (see post by Alberth). It is good enough, within 1 digit of float precision, and "proven", and it is consistently the fastest in my tests (22.5 times faster than the C library).

 

 

My results, if anyone is interested:

-O3 -march=skylake -fpmath=sse:
sin               : 37470663    [0.120873]
sin_garrett_c11   : 1585131     [0.005113]
sin_garrett_c11_s : 2178746     [0.007028]
sin_spiro_s       : 3010460     [0.009711]
sin_spiro         : 2755805     [0.008890]
sin_adam42_s      : 5332894     [0.017203]
sin_adam42        : 5332894     [0.017203]
ratio: 23.6388431

with --fast-math also:
sin               : 37378012    [0.120574]
sin_garrett_c11   : 1585772     [0.005115]
sin_garrett_c11_s : 2179679     [0.007031]
sin_spiro_s       : 3010939     [0.009713]
sin_spiro         : 2783895     [0.008980]
sin_adam42_s      : 5320730     [0.017164]
sin_adam42        : 5320730     [0.017164]
ratio: 23.570861

with -fpmath=387 instead:
sin               : 36788379    [0.118672]
sin_garrett_c11   : 1633470     [0.005269]
sin_garrett_c11_s : 2140569     [0.006905]
sin_spiro_s       : 2981539     [0.009618]
sin_spiro         : 2742520     [0.008847]
sin_adam42_s      : 5453327     [0.017591]
sin_adam42        : 5453327     [0.017591]
ratio: 22.521613

with -fpmath=sse,387 instead:
sin               : 36832914    [0.118816]
sin_garrett_c11   : 1636284     [0.005278]
sin_garrett_c11_s : 2141416     [0.006908]
sin_spiro_s       : 2982003     [0.009619]
sin_spiro         : 2740198     [0.008839]
sin_adam42_s      : 5444734     [0.017564]
sin_adam42        : 5444734     [0.017564]
ratio: 22.510098

You can make those functions significantly faster [...] I tested this in a VS 2015 x64 release build. YMMV.

This one is funny. On my machine, with GCC 6.1 (64 bit), it takes about twice as long. Wonder how it can be so different.

 

Share this post


Link to post
Share on other sites
You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower.
I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time.
The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate).


L. Spiro

Share this post


Link to post
Share on other sites

You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower.
I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time.
The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate).


L. Spiro

 

Yes, that's true. Running from -PI to +PI in a million steps, and looking at the difference to the C lib function, we get:

sin                : emax=0.000000000000000     eavg=0.000000000000000  sse=0.000000000000000  // unsurprising
sin_garrett_c11    : emax=0.000000291691886     eavg=0.000000051244958  sse=0.000000003505368
sin_garrett_c11_s  : emax=0.000000472727176     eavg=0.000000058468439  sse=0.000000005403125
sin_spiro_s        : emax=0.000000350934626     eavg=0.000000040066997  sse=0.000000003403369
sin_spiro          : emax=0.000000019798559     eavg=0.000000008708427  sse=0.000000000126031
sin_adam42_s       : emax=0.000000592055687     eavg=0.000000044807770  sse=0.000000004318299
sin_adam42         : emax=0.000000019798559     eavg=0.000000008708427  sse=0.000000000126031

That's the double precision version of your function which is the clear winner (the one you posted is single precision). But double precision runs faster than single precision version, anyway. Interestingly, there are no observable rounding errors between yours and adam_42's version, I would have expected that -- after all they perform operations ordered differently, so the results should differ very slightly.

 

Will be interesting to see how well the C11 version fares. Question is which metric is most important, I'm almost inclined to think "max error".

Edited by samoth

Share this post


Link to post
Share on other sites

I can see L. Spiro is doing a good job of getting fast approximations to sin and cos, but I wonder why this is an important problem. What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?
 


It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?

 

EDIT: This forum is too smart for its own good. I am making two unrelated comments and I am trying to post them in two separate posts, but the forum combines them. Annoying.

Edited by Álvaro

Share this post


Link to post
Share on other sites

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?

 

Calculating object rotations/orientations.

Share this post


Link to post
Share on other sites

 

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?

Calculating object rotations/orientations.

 

Seeing how one billion invocations take slightly under 0.4 seconds on my machine, this seems to be a very serious problem! If you have only 1ms of time budget for all your sin/cos calculations, and you can't do parallel work, you can draw no more than 2.5 million objects!

But of course, doing some pointless optimizations is never bad, I'm all for that. Faster is always better :)

Joke aside, this here:
 
 

It would be nice if the results were never larger than one

indeed seems to be less expressed in Garrett's implementation. Although I only see it in Spiro's single-precision version, the double-precision version seems to be immune to it. On one billion samples between -Pi and Pi, Spiro single-precision has 380 hits, Adam_42 single-precision even has 2030 (why???), all others have zero hits.

 

So, as long as you use the double version, you should be good, no values greater one (you can probably solve that equation to see exactly which values, if any, will provide outputs greater than one, but I'm too lazy for that and too many years outa school... leaving that as homework for someone else, hehehe... a billion samples and zero hits is good enough for me!).

Edited by samoth

Share this post


Link to post
Share on other sites

It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?

All implementations of sin()/cos() on digital machinery are approximations, and necessarily there does not exist a case in which being slightly above 1.0 causes a problem, particularly in real-time situations.
The simple act of normalizing a vector using the standard sinf() and cosf() (or even the double-sized sin() and cos()) functions creates vectors with a length greater than 1.0 approximately half of the time.


Although I could add a further constraint on my evaluations such that I determine the most accurate constants that never result in a value above 1.0, the point of these approximations is that you are not so strict so that you can gain speed.

Eventually I will get to the 21st-degree polynomials. I will add an extra constraint then, because by that point you are more concerned with accuracy than performance, as 21st-degree polynomials tend to perform poorer than his paper suggests on multiple hardwares, including the current consoles.

Meanwhile, although my float values have been evaluated in double-sized algorithms, I have determined double constants that provide an even-better approximation for a 15th-degree polynomial.
I will post those soon.


L. Spiro

Share this post


Link to post
Share on other sites

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?

 
Calculating object rotations/orientations.


Do people really do that? 3D rotations are easily encoded using unit-length quaternions, and 2D rotations using unit-length complex numbers. If you use those, you basically don't need any trigonometric functions for rotations/orientations, just multiply-adds. I thought that's what everyone would be doing.

Share this post


Link to post
Share on other sites

But double precision runs faster than single precision version, anyway

With what compiler settings?
X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.
With any setting, funnily. Complete timings with and without --fast-math (also with SSE and 387 math, and combi-mode) are 7 posts above. Double is faster every time.
Must be that all calculations are double anyway (sure are for 387, but I would have assumed differently for SSE), and single precision forces an explicit conversion somewhere. Whatever it is, we're talking like 7.5% on an operation in the single-nanosecond range, so really no biggie.

Might be more visible on a single call. Maybe it's woth RDTSCing a single call to see how many cycles go into it. I assume there is some really hefty pipelining involved in that benchmark because if you break down the time measured by QPC and number of iterations, you get a number that's around 5-6 or so clock cycles for the fastest method. Which is obviously much faster than this chain of multiplications and additions can possibly be. Unless QPC is lying to me... but wall clock says something very similar. Or very deep pipelining.

But in a real application where it matters (because you do hundred thousands of calls in a batch), you will have that very same pipelining, so I guess the benchmark is not "wrong" as such. If you only have to calculate a handful of sines, who cares anyway.

Share this post


Link to post
Share on other sites
What's the critical path? (I'm far too lazy rn.) There's a lot of parallelizable multiplication there. If it's flooding the available multipliers then it's likely that near future hardware (with more mul units) will give a greater improvement.

...

How many multipliers do modern cores have now anyway? Or are FPUs monolithic?

Share this post


Link to post
Share on other sites

One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here?

Share this post


Link to post
Share on other sites

A couple of updates. First, on why double precision is faster than single...

vmulss %xmm0,%xmm0,%xmm1
vcvtss2sd %xmm0,%xmm0,%xmm0
vcvtss2sd %xmm1,%xmm1,%xmm1
vfmadd213sd 0x8b3b(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b3a(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b39(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b38(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b37(%rip),%xmm2,%xmm1
vmulsd %xmm0,%xmm1,%xmm0
vcvtsd2ss %xmm0,%xmm0,%xmm0  <-------- aha!

Yep, that's why. Both do the same, except the last line.

 

Second, having looked at the disassembly now, I'm shocked that GCC doesn't even inline a single of these functions! It actually does function calls! Why?
Well, I'm using a function that takes a functor like this:

template<typename F> auto test(F f, [blah blah])
{
    ...
    qpc.start();
        for(...)
            sum += f(t);
    qpc.stop();

    volatile double discard = sum;

    return qpc;
}

The assumption is that, of course, the compiler will inline that function pointer to a simple three-liner since it's being called on a darn constant expression. Which, of course, the compiler can see immediately, just like it can see immediately that the function is rather trivial and easily inlineable. Guess what, it doesn't. However, changing the function call to a lambda will work just fine. Tell me about being unlogical. Grrr...

 

On the positive side, this means that the custom functions are really faster because they're faster, not because the compiler inlines them and doesn't inline the library call.

Now, trying to get GCC to auto-vectorize this over an array of 10k doubles doesn't seem to work. Even if you "help" it and manually unroll the loop 4 times, it just generates 4 times the scalar code. Oh well, was worth trying.

Share this post


Link to post
Share on other sites

One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here?

 
If that's true, you could probably remove the branch entirely:
fX2 * {....};

float sign = ((float(i32I & 1) * -2.0f) + 1.0f);
return (sign * fX2)
 
One int-to-float cast, an add and two multiplies.
 
Or, uhm:
int sign = 1 - ((i32I & 1) << 1);
return (sign * fX2)
One (implicit) int-to-float cast, a leftshift, a subtraction, and one multiply.
 
It'd have to actually be profiled to see if saving the branch is worth it, ofcourse.

Share this post


Link to post
Share on other sites

So I implemented a few sanity tests for fun, such as what's the value for sin(0) or sin(-PI), or whether functions are monotonically rising/falling within the respective intervals.

 

Maybe my tests are whacky, will have to look closer tomorrow morning... but... not a single implementation passes the tests :lol:

(including the C lib)

Share this post


Link to post
Share on other sites
Since 11th-degree sin and 10th-degree cosine are popular (used by Unreal and Garrett), I rederived them.
 
float Sin_11( float _fX ) {
	int32_t i32I = int32_t( _fX * 0.31830988618379067153776752674503f );	// 1 / PI.
	_fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f);

	float fX2 = _fX * _fX;
	// Average error:
	//	0.000000071991304556528727213541666666667
	// Max error:
	//	0.000000543892383575439453125
	return (i32I & 1) ?
		-_fX * (float( 9.99999701976776123047e-01 ) +
			fX2 * (float( -1.66665777564048767090e-01 ) +
			fX2 * (float( 8.33255797624588012695e-03 ) +
			fX2 * (float( -1.98125766473822295666e-04 ) +
			fX2 * (float( 2.70405212177138309926e-06 ) +
			fX2 * float( -2.05329886426852681325e-08 )))))) :
		_fX * (float( 9.99999701976776123047e-01 ) +
			fX2 * (float( -1.66665777564048767090e-01 ) +
			fX2 * (float( 8.33255797624588012695e-03 ) +
			fX2 * (float( -1.98125766473822295666e-04 ) +
			fX2 * (float( 2.70405212177138309926e-06 ) +
			fX2 * float( -2.05329886426852681325e-08 ))))));
}
float Cos_10( float _fX ) {
	int32_t i32I = int32_t( _fX * 0.31830988618379067153776752674503f );	// 1 / PI.
	_fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f);

	float fX2 = _fX * _fX;

	// Average error:
	//	0.00000084443936666666666666666666666667
	// Max error:
	//	0.00000244379044
	return (i32I & 1) ?
		float( -9.99999463558197021484e-01 ) -
			fX2 * (float( -4.99995589256286621094e-01 ) +
			fX2 * (float( 4.16610352694988250732e-02 ) +
			fX2 * (float( -1.38627504929900169373e-03 ) +
			fX2 * (float( 2.42532332777045667171e-05 ) +
			fX2 * float( -2.21941789391166821588e-07 ))))) :
		float( 9.99999463558197021484e-01 ) +
			fX2 * (float( -4.99995589256286621094e-01 ) +
			fX2 * (float( 4.16610352694988250732e-02 ) +
			fX2 * (float( -1.38627504929900169373e-03 ) +
			fX2 * (float( 2.42532332777045667171e-05 ) +
			fX2 * float( -2.21941789391166821588e-07 )))));
}
Error is derived from 12,000,000 samples from -PI to PI. Here are the results.
Sin
I derived these values with lower averages and lower max errors.  I also added the constraint that it never generate values above 1.
 

Unreal Engine uses these constants:
1.0 0.16666667 0.0083333310 0.00019840874 2.7525562e-06 2.3889859e-08
Average error: 0.000011496102666666666666666666666667
Max error: 0.000187166996
 
Garrett uses these constants:
9.99999707044156546685e-01 1.66665772196961623983e-01 8.33255814755188010464e-03 1.98125763417806681909e-04 2.70405218307799040084e-06 2.05342856289746600727e-08
Average error: 0.000000086373766666666666666666666666667
Max error: 0.000000879168510

I use the these constants:
9.99999701976776123047e-01 1.66665777564048767090e-01 8.33255797624588012695e-03 1.98125766473822295666e-04 2.70405212177138309926e-06 2.05329886426852681325e-08
Average error: 0.000000071991304583333333333333333333333
Max error: 0.000000543892384


The Unreal Engine 4 implementation is extremely inaccurate and branch-heavy. It also produces values above 1.
 
 
Cos
Garrett’s constants produced values above 1, so I did not add that constraint to my new constants. And in fact I only changed one constant, and only by a very tiny amount.


Unreal Engine uses these constants:
1.0 0.5 0.041666638 0.0013888378 2.4760495e-05 2.6051615e-07
Average error: 0.00010072299666666666666666666666667
Max error: 0.000762462616

Garrett uses these constants:
9.99999443739537210853e-1 4.99995582499065048420e-01 4.16610337354021107429e-02 1.38627507062573673756e-03 2.42532401381033027481e-05 2.21941782786353727022e-07
Average error: 0.00000084569900833333333333333333333333
Max error: 0.00000244379044

I use the these constants:
9.99999463558197021484e-01 4.99995589256286621094e-01 4.16610352694988250732e-02 1.38627504929900169373e-03 2.42532332777045667171e-05 2.21941789391166821588e-07
Average error: 0.00000084443936666666666666666666666667
Max error: 0.00000244379044


Although all of my numbers look different from Garrett’s, that is only because his numbers are prior to casting to float.
After a cast to float, all of his numbers are the same as mine except the 4th, which I lowered very slightly.


L. Spiro Edited by L. Spiro

Share this post


Link to post
Share on other sites

  If that's true, you could probably remove the branch entirely
Or wrap around the "i32I & 1" one bit to the right then use that to flip _fX's sign bit accordingly.

 

In any case, no branch is faster than branch on my end (using doubles).

Share this post


Link to post
Share on other sites
I’ve tested the branchless paths some time ago, because branchless should intuitively be the way to go.
They didn’t give the performance I desired on consoles. I didn’t investigate deeply why that was because I was on the clock.

It will be one of my upcoming investigations. For now I am focusing on accuracy, and next I will focus on performance.


L. Spiro

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this