L. Spiro 25641 Report post Posted August 28, 2016 (edited) If you search for these terms you will get a lengthy list of possibilities. Look-up tables from the old id Tech days are too cache-miss heavy. Other implementations are only faster if you are re-implementing sincos(), including the Unreal Engine 4 implementation. Sony presents an idea using Chebyshev polynomials, but we are bound to certain degrees of accuracy based on how far out you wish to expand the Taylor series. With the goal of making something both fast and accurate in mind, I have come up with the following functions. float Sin( float _fX ) { int32 i32I = int32( _fX * 0.31830988618379067153776752674503f ); // 1 / PI. _fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f); float fX2 = _fX * _fX; return (i32I & 1) ? -_fX * (float( 1.0 ) + fX2 * (float( -1.66666671633720398e-01 ) + fX2 * (float( 8.33333376795053482e-03 ) + fX2 * (float( -1.98412497411482036e-04 ) + fX2 * (float( 2.75565571428160183e-06 ) + fX2 * (float( -2.50368437093584362e-08 ) + fX2 * (float( 1.58846852338356825e-10 ) + fX2 * float( -6.57978446033657960e-13 )))))))) : _fX * (float( 1.0 ) + fX2 * (float( -1.66666671633720398e-01 ) + fX2 * (float( 8.33333376795053482e-03 ) + fX2 * (float( -1.98412497411482036e-04 ) + fX2 * (float( 2.75565571428160183e-06 ) + fX2 * (float( -2.50368437093584362e-08 ) + fX2 * (float( 1.58846852338356825e-10 ) + fX2 * float( -6.57978446033657960e-13 )))))))); } float Cos( float _fX ) { int32 i32I = int32( _fX * 0.31830988618379067153776752674503f ); // 1 / PI. _fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f); float fX2 = _fX * _fX; return (i32I & 1) ? -float( 1.0 ) - fX2 * (float( -5.00000000000000000e-01 ) + fX2 * (float( 4.16666641831398010e-02 ) + fX2 * (float( -1.38888671062886715e-03 ) + fX2 * (float( 2.48006836045533419e-05 ) + fX2 * (float( -2.75369188784679864e-07 ) + fX2 * (float( 2.06202765973273472e-09 ) + fX2 * float( -9.77589970779790818e-12 ))))))) : float( 1.0 ) + fX2 * (float( -5.00000000000000000e-01 ) + fX2 * (float( 4.16666641831398010e-02 ) + fX2 * (float( -1.38888671062886715e-03 ) + fX2 * (float( 2.48006836045533419e-05 ) + fX2 * (float( -2.75369188784679864e-07 ) + fX2 * (float( 2.06202765973273472e-09 ) + fX2 * float( -9.77589970779790818e-12 ))))))); } Performance on PC may vary, from 1.02 times as fast to 2.0 times as fast. On PlayStation 4 this is around 7 or 8 time as fast. On Xbox One this is around 2 or 3 times as fast. Accuracy is no fewer than 6 digits. I implemented a less-accurate version on Final Fantasy XV, so these versions are entirely suitable for any AAA production.Explanation: The code starts off by using fmodf() on PI. This is implemented manually via a cast to an integer. This gives it a valid range of ±52,707,130.87185. cos() and sin() are curves that go up, then down, then up, etc. “i32I & 1” checks for it being an up curve or down curve. i32I represents the number of PI denominators, each even going one way, each odd going another way. Here is the fancy part. You will notice that the magic constants start off being close to one-over each odd factorial and each even factorial. 0.16666666666666666666666666666667 = 1 / 6 (6 - 1*2*3) 0.00833333333333333333333333333333 = 1 / 120 (120 = 1*2*3*4*5) etc. But by the end, they drift rather significantly. 1/17! = 7.6471637318198164759011319857881e-13 I use 6.57978446033657960e-13 The reason is that the series should normally continue on into infinity, but we cut it short. In the case of Sin(), if we don’t account for this, our numbers drift low (because we actually use the negative of the constant and it overshoots low). Using a lower number as I have done accounts for this. I’ve adjusted each of the constants to account for this type of drift and give the best-possible accuracy for this degree of precision. The precision here is enough to entirely drive a AAA game such as Final Fantasy XV, Star Ocean 5, and others. Later I will re-evaluate the constants used in Unreal Engine 4, and then I will post a super-fast version. L. Spiro Edited August 28, 2016 by L. Spiro 24 Share this post Link to post Share on other sites
Alberth 9534 Report post Posted August 28, 2016 (edited) you know about this paper, http://krisgarrett.net/papers/l2approx.pdf where you can pick one of a few ranges, and a desired accuracy, and it gives you a polynomial for it? It doesn't look like a officially published journal paper, so I don' know how good it is. Edited August 28, 2016 by Alberth 8 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted August 31, 2016 I was not aware of that paper. Their constants produced results slightly better than my own, so I modified my constants to produce results significantly better than theirs. These might be the best possible constants.float Sin( float _fX ) { int32 i32I = int32( _fX * (1.0f / PI) ); _fX = (_fX - float( i32I ) * PI); float fX2 = _fX * _fX; return (i32I & 1) ? -_fX * (float( 1.00000000000000000000e+00 ) + fX2 * (float( -1.66666671633720397949e-01 ) + fX2 * (float( 8.33333376795053482056e-03 ) + fX2 * (float( -1.98412497411482036114e-04 ) + fX2 * (float( 2.75565571428160183132e-06 ) + fX2 * (float( -2.50368472620721149724e-08 ) + fX2 * (float( 1.58849267073435385100e-10 ) + fX2 * float( -6.58925550841432672300e-13 )))))))) : _fX * (float( 1.00000000000000000000e+00 ) + fX2 * (float( -1.66666671633720397949e-01 ) + fX2 * (float( 8.33333376795053482056e-03 ) + fX2 * (float( -1.98412497411482036114e-04 ) + fX2 * (float( 2.75565571428160183132e-06 ) + fX2 * (float( -2.50368472620721149724e-08 ) + fX2 * (float( 1.58849267073435385100e-10 ) + fX2 * float( -6.58925550841432672300e-13 )))))))); }float Cos( float _fX ) { int32 i32I = int32( _fX * (1.0f / PI) ); _fX = (_fX - float( i32I ) * PI); float fX2 = _fX * _fX; return (i32I & 1) ? float( -1.00000000000000000000e+00 ) - fX2 * (float( -5.00000000000000000000e-01 ) + fX2 * (float( 4.16666641831398010254e-02 ) + fX2 * (float( -1.38888671062886714935e-03 ) + fX2 * (float( 2.48006890615215525031e-05 ) + fX2 * (float( -2.75369927749125054106e-07 ) + fX2 * (float( 2.06207229069832465029e-09 ) + fX2 * float( -9.77507137733812925262e-12 ))))))) : float( 1.00000000000000000000e+00 ) + fX2 * (float( -5.00000000000000000000e-01 ) + fX2 * (float( 4.16666641831398010254e-02 ) + fX2 * (float( -1.38888671062886714935e-03 ) + fX2 * (float( 2.48006890615215525031e-05 ) + fX2 * (float( -2.75369927749125054106e-07 ) + fX2 * (float( 2.06207229069832465029e-09 ) + fX2 * float( -9.77507137733812925262e-12 ))))))); }My error metric is not based on the maximum error as theirs is, but rather the total error over the whole range from -PI to PI, and my sampling rate is 12 times denser than theirs.By this metric my new values reduce the error by approximately 34%.These numbers look significantly different from theirs partly because these numbers are fitted to floats (I just print 20 decimal places so that I can easily switch between float and double).Their 8-constant cosine begins with 9.99999999919365479957e-01 and 4.99999998886526927002e-01, but these are actually exactly 1.0f and 0.5f when cast to float.However, for the last constant on sin(), they have chosen 6.58075489175121657026e-13, and I have chosen 6.58925550841432672300e-13, which is significantly different.When we get into doubles, their values are much more accurate, which suggests they have not fitted their values to floats at all.Nonetheless, I was still able to find constants that provide a more-accurate result. I will post these later.L. Spiro 13 Share this post Link to post Share on other sites
Adam_42 3630 Report post Posted September 1, 2016 You can make those functions significantly faster (at least on PC) by rearranging the expressions to cut down on dependencies between instructions. This let's the CPU pipeline work more efficiently. The downside is that you lose a little accuracy. You may be able to get some of that back by tweaking the constants, and/or the bracketing for the adds. Here's what I ended up with: float Sin(float x) { int32_t i32I = int32_t( x * (1.0f / PI) ); x = (x - float( i32I ) * PI); float fX2 = x * x; float fX4 = fX2 * fX2; float fX6 = fX2 * fX4; float fX8 = fX4 * fX4; float fX10 = fX6 * fX4; float fX12 = fX6 * fX6; float fX14 = fX6 * fX8; return (i32I & 1) ? -x * (float( 1.00000000000000000000e+00 ) + (fX2 * float( -1.66666671633720397949e-01 )) + ((fX4 * float( 8.33333376795053482056e-03 )) + (fX6 * float( -1.98412497411482036114e-04 ))) + ((fX8 * float( 2.75565571428160183132e-06 )) + (fX10 * float( -2.50368472620721149724e-08 ))) + ((fX12 * float( 1.58849267073435385100e-10 )) + (fX14 * float( -6.58925550841432672300e-13 ))) ): x * (float( 1.00000000000000000000e+00 ) + (fX2 * float( -1.66666671633720397949e-01 )) + ((fX4 * float( 8.33333376795053482056e-03 )) + (fX6 * float( -1.98412497411482036114e-04 ))) + ((fX8 * float( 2.75565571428160183132e-06 )) + (fX10 * float( -2.50368472620721149724e-08 ))) + ((fX12 * float( 1.58849267073435385100e-10 )) + (fX14 * float( -6.58925550841432672300e-13 ))) ); } float Cos(float x) { int32_t i32I = int32_t( x * (1.0f / PI) ); x = (x - float( i32I ) * PI); float fX2 = x * x; float fX4 = fX2 * fX2; float fX6 = fX2 * fX4; float fX8 = fX4 * fX4; float fX10 = fX6 * fX4; float fX12 = fX6 * fX6; float fX14 = fX6 * fX8; return (i32I & 1) ? float( -1.00000000000000000000e+00 ) - ( (fX2 * float( -5.00000000000000000000e-01 )) + ((fX4 * float( 4.16666641831398010254e-02 )) + (fX6 * float( -1.38888671062886714935e-03 ))) + ((fX8 * float( 2.48006890615215525031e-05 )) + (fX10 * float( -2.75369927749125054106e-07 ))) + ((fX12 * float( 2.06207229069832465029e-09 )) + (fX14 * float( -9.77507137733812925262e-12 ))) ) : float( 1.00000000000000000000e+00 ) + ( (fX2 * float( -5.00000000000000000000e-01 )) + ((fX4 * float( 4.16666641831398010254e-02 )) + (fX6 * float( -1.38888671062886714935e-03 ))) + ((fX8 * float( 2.48006890615215525031e-05 )) + (fX10 * float( -2.75369927749125054106e-07 ))) + ((fX12 * float( 2.06207229069832465029e-09 )) + (fX14 * float( -9.77507137733812925262e-12 ))) ); } I tested this in a VS 2015 x64 release build. YMMV. 5 Share this post Link to post Share on other sites
danybittel 374 Report post Posted September 2, 2016 Hi I'm curious, how do you do the fitting/checking? Do you use something like Mathematica? Or another software you can recommend? Or do you use some test routines you write yourself? 1 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted September 2, 2016 I'm curious, how do you do the fitting/checking? Do you use something like Mathematica? Or another software you can recommend? Or do you use some test routines you write yourself?I wrote my own tests. Mathematica and other tools are good for getting a result mathematically near-exact, but once you feed those numbers into a computer you get different values.Fitting the values to floats or doubles means to adjust the numbers to account for truncation that happens when converting to floats or doubles.Accounting for this requires custom tools.Later I will describe how I check the accuracy and will post all of the results as well as benchmark tests.I will additionally benchmark Adam_42’s version as well as formatting the way they show in the paper linked by Alberth.L. Spiro 2 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 2, 2016 Interesting endeavour, but I'll stay with Garrett's implementation (see post by Alberth). It is good enough, within 1 digit of float precision, and "proven", and it is consistently the fastest in my tests (22.5 times faster than the C library). My results, if anyone is interested: -O3 -march=skylake -fpmath=sse: sin : 37470663 [0.120873] sin_garrett_c11 : 1585131 [0.005113] sin_garrett_c11_s : 2178746 [0.007028] sin_spiro_s : 3010460 [0.009711] sin_spiro : 2755805 [0.008890] sin_adam42_s : 5332894 [0.017203] sin_adam42 : 5332894 [0.017203] ratio: 23.6388431 with --fast-math also: sin : 37378012 [0.120574] sin_garrett_c11 : 1585772 [0.005115] sin_garrett_c11_s : 2179679 [0.007031] sin_spiro_s : 3010939 [0.009713] sin_spiro : 2783895 [0.008980] sin_adam42_s : 5320730 [0.017164] sin_adam42 : 5320730 [0.017164] ratio: 23.570861 with -fpmath=387 instead: sin : 36788379 [0.118672] sin_garrett_c11 : 1633470 [0.005269] sin_garrett_c11_s : 2140569 [0.006905] sin_spiro_s : 2981539 [0.009618] sin_spiro : 2742520 [0.008847] sin_adam42_s : 5453327 [0.017591] sin_adam42 : 5453327 [0.017591] ratio: 22.521613 with -fpmath=sse,387 instead: sin : 36832914 [0.118816] sin_garrett_c11 : 1636284 [0.005278] sin_garrett_c11_s : 2141416 [0.006908] sin_spiro_s : 2982003 [0.009619] sin_spiro : 2740198 [0.008839] sin_adam42_s : 5444734 [0.017564] sin_adam42 : 5444734 [0.017564] ratio: 22.510098 You can make those functions significantly faster [...] I tested this in a VS 2015 x64 release build. YMMV. This one is funny. On my machine, with GCC 6.1 (64 bit), it takes about twice as long. Wonder how it can be so different. 3 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted September 2, 2016 You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower.I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time.The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate).L. Spiro 1 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 2, 2016 (edited) You are comparing his 11th-degree version to my 15th-degree version, which will naturally be slower. I would only be interested in comparing his 15th-degree to my 15th-degree. I will derive better 11th-degree constants later—these take time. The performance should be exactly equal (unless re-ordering the expression matters) but provide a tighter fit (more accurate). L. Spiro Yes, that's true. Running from -PI to +PI in a million steps, and looking at the difference to the C lib function, we get: sin : emax=0.000000000000000 eavg=0.000000000000000 sse=0.000000000000000 // unsurprising sin_garrett_c11 : emax=0.000000291691886 eavg=0.000000051244958 sse=0.000000003505368 sin_garrett_c11_s : emax=0.000000472727176 eavg=0.000000058468439 sse=0.000000005403125 sin_spiro_s : emax=0.000000350934626 eavg=0.000000040066997 sse=0.000000003403369 sin_spiro : emax=0.000000019798559 eavg=0.000000008708427 sse=0.000000000126031 sin_adam42_s : emax=0.000000592055687 eavg=0.000000044807770 sse=0.000000004318299 sin_adam42 : emax=0.000000019798559 eavg=0.000000008708427 sse=0.000000000126031 That's the double precision version of your function which is the clear winner (the one you posted is single precision). But double precision runs faster than single precision version, anyway. Interestingly, there are no observable rounding errors between yours and adam_42's version, I would have expected that -- after all they perform operations ordered differently, so the results should differ very slightly. Will be interesting to see how well the C11 version fares. Question is which metric is most important, I'm almost inclined to think "max error". Edited September 2, 2016 by samoth 2 Share this post Link to post Share on other sites
alvaro 21272 Report post Posted September 2, 2016 (edited) I can see L. Spiro is doing a good job of getting fast approximations to sin and cos, but I wonder why this is an important problem. What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time? It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that? EDIT: This forum is too smart for its own good. I am making two unrelated comments and I am trying to post them in two separate posts, but the forum combines them. Annoying. Edited September 2, 2016 by Álvaro 1 Share this post Link to post Share on other sites
Kylotan 10021 Report post Posted September 2, 2016 What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time? Calculating object rotations/orientations. 1 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 2, 2016 (edited) What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time? Calculating object rotations/orientations. Seeing how one billion invocations take slightly under 0.4 seconds on my machine, this seems to be a very serious problem! If you have only 1ms of time budget for all your sin/cos calculations, and you can't do parallel work, you can draw no more than 2.5 million objects! But of course, doing some pointless optimizations is never bad, I'm all for that. Faster is always better :) Joke aside, this here: It would be nice if the results were never larger than one indeed seems to be less expressed in Garrett's implementation. Although I only see it in Spiro's single-precision version, the double-precision version seems to be immune to it. On one billion samples between -Pi and Pi, Spiro single-precision has 380 hits, Adam_42 single-precision even has 2030 (why???), all others have zero hits. So, as long as you use the double version, you should be good, no values greater one (you can probably solve that equation to see exactly which values, if any, will provide outputs greater than one, but I'm too lazy for that and too many years outa school... leaving that as homework for someone else, hehehe... a billion samples and zero hits is good enough for me!). Edited September 2, 2016 by samoth 2 Share this post Link to post Share on other sites
Hodgman 51421 Report post Posted September 2, 2016 But double precision runs faster than single precision version, anywayWith what compiler settings? X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness. 1 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted September 3, 2016 It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?All implementations of sin()/cos() on digital machinery are approximations, and necessarily there does not exist a case in which being slightly above 1.0 causes a problem, particularly in real-time situations.The simple act of normalizing a vector using the standard sinf() and cosf() (or even the double-sized sin() and cos()) functions creates vectors with a length greater than 1.0 approximately half of the time.Although I could add a further constraint on my evaluations such that I determine the most accurate constants that never result in a value above 1.0, the point of these approximations is that you are not so strict so that you can gain speed.Eventually I will get to the 21st-degree polynomials. I will add an extra constraint then, because by that point you are more concerned with accuracy than performance, as 21st-degree polynomials tend to perform poorer than his paper suggests on multiple hardwares, including the current consoles.Meanwhile, although my float values have been evaluated in double-sized algorithms, I have determined double constants that provide an even-better approximation for a 15th-degree polynomial.I will post those soon.L. Spiro 1 Share this post Link to post Share on other sites
alvaro 21272 Report post Posted September 3, 2016 What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time? Calculating object rotations/orientations.Do people really do that? 3D rotations are easily encoded using unit-length quaternions, and 2D rotations using unit-length complex numbers. If you use those, you basically don't need any trigonometric functions for rotations/orientations, just multiply-adds. I thought that's what everyone would be doing. 1 Share this post Link to post Share on other sites
Khatharr 8812 Report post Posted September 3, 2016 Someone put this blanket in the dryer. 1 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 3, 2016 But double precision runs faster than single precision version, anywayWith what compiler settings? X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.With any setting, funnily. Complete timings with and without --fast-math (also with SSE and 387 math, and combi-mode) are 7 posts above. Double is faster every time. Must be that all calculations are double anyway (sure are for 387, but I would have assumed differently for SSE), and single precision forces an explicit conversion somewhere. Whatever it is, we're talking like 7.5% on an operation in the single-nanosecond range, so really no biggie. Might be more visible on a single call. Maybe it's woth RDTSCing a single call to see how many cycles go into it. I assume there is some really hefty pipelining involved in that benchmark because if you break down the time measured by QPC and number of iterations, you get a number that's around 5-6 or so clock cycles for the fastest method. Which is obviously much faster than this chain of multiplications and additions can possibly be. Unless QPC is lying to me... but wall clock says something very similar. Or very deep pipelining. But in a real application where it matters (because you do hundred thousands of calls in a batch), you will have that very same pipelining, so I guess the benchmark is not "wrong" as such. If you only have to calculate a handful of sines, who cares anyway. 1 Share this post Link to post Share on other sites
Khatharr 8812 Report post Posted September 3, 2016 What's the critical path? (I'm far too lazy rn.) There's a lot of parallelizable multiplication there. If it's flooding the available multipliers then it's likely that near future hardware (with more mul units) will give a greater improvement. ... How many multipliers do modern cores have now anyway? Or are FPUs monolithic? 1 Share this post Link to post Share on other sites
TheChubu 9459 Report post Posted September 3, 2016 One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here? 2 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 3, 2016 A couple of updates. First, on why double precision is faster than single... vmulss %xmm0,%xmm0,%xmm1 vcvtss2sd %xmm0,%xmm0,%xmm0 vcvtss2sd %xmm1,%xmm1,%xmm1 vfmadd213sd 0x8b3b(%rip),%xmm1,%xmm2 vfmadd213sd 0x8b3a(%rip),%xmm1,%xmm2 vfmadd213sd 0x8b39(%rip),%xmm1,%xmm2 vfmadd213sd 0x8b38(%rip),%xmm1,%xmm2 vfmadd213sd 0x8b37(%rip),%xmm2,%xmm1 vmulsd %xmm0,%xmm1,%xmm0 vcvtsd2ss %xmm0,%xmm0,%xmm0 <-------- aha! Yep, that's why. Both do the same, except the last line. Second, having looked at the disassembly now, I'm shocked that GCC doesn't even inline a single of these functions! It actually does function calls! Why? Well, I'm using a function that takes a functor like this: template<typename F> auto test(F f, [blah blah]) { ... qpc.start(); for(...) sum += f(t); qpc.stop(); volatile double discard = sum; return qpc; } The assumption is that, of course, the compiler will inline that function pointer to a simple three-liner since it's being called on a darn constant expression. Which, of course, the compiler can see immediately, just like it can see immediately that the function is rather trivial and easily inlineable. Guess what, it doesn't. However, changing the function call to a lambda will work just fine. Tell me about being unlogical. Grrr... On the positive side, this means that the custom functions are really faster because they're faster, not because the compiler inlines them and doesn't inline the library call. Now, trying to get GCC to auto-vectorize this over an array of 10k doubles doesn't seem to work. Even if you "help" it and manually unroll the loop 4 times, it just generates 4 times the scalar code. Oh well, was worth trying. 1 Share this post Link to post Share on other sites
Servant of the Lord 33722 Report post Posted September 3, 2016 One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here? If that's true, you could probably remove the branch entirely:fX2 * {....}; float sign = ((float(i32I & 1) * -2.0f) + 1.0f); return (sign * fX2) One int-to-float cast, an add and two multiplies. Or, uhm:int sign = 1 - ((i32I & 1) << 1); return (sign * fX2)One (implicit) int-to-float cast, a leftshift, a subtraction, and one multiply. It'd have to actually be profiled to see if saving the branch is worth it, ofcourse. 1 Share this post Link to post Share on other sites
formerly_known_as_samoth 9827 Report post Posted September 3, 2016 So I implemented a few sanity tests for fun, such as what's the value for sin(0) or sin(-PI), or whether functions are monotonically rising/falling within the respective intervals. Maybe my tests are whacky, will have to look closer tomorrow morning... but... not a single implementation passes the tests :lol: (including the C lib) 1 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted September 3, 2016 (edited) Since 11th-degree sin and 10th-degree cosine are popular (used by Unreal and Garrett), I rederived them. float Sin_11( float _fX ) { int32_t i32I = int32_t( _fX * 0.31830988618379067153776752674503f ); // 1 / PI. _fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f); float fX2 = _fX * _fX; // Average error: // 0.000000071991304556528727213541666666667 // Max error: // 0.000000543892383575439453125 return (i32I & 1) ? -_fX * (float( 9.99999701976776123047e-01 ) + fX2 * (float( -1.66665777564048767090e-01 ) + fX2 * (float( 8.33255797624588012695e-03 ) + fX2 * (float( -1.98125766473822295666e-04 ) + fX2 * (float( 2.70405212177138309926e-06 ) + fX2 * float( -2.05329886426852681325e-08 )))))) : _fX * (float( 9.99999701976776123047e-01 ) + fX2 * (float( -1.66665777564048767090e-01 ) + fX2 * (float( 8.33255797624588012695e-03 ) + fX2 * (float( -1.98125766473822295666e-04 ) + fX2 * (float( 2.70405212177138309926e-06 ) + fX2 * float( -2.05329886426852681325e-08 )))))); }float Cos_10( float _fX ) { int32_t i32I = int32_t( _fX * 0.31830988618379067153776752674503f ); // 1 / PI. _fX = (_fX - float( i32I ) * 3.1415926535897932384626433832795f); float fX2 = _fX * _fX; // Average error: // 0.00000084443936666666666666666666666667 // Max error: // 0.00000244379044 return (i32I & 1) ? float( -9.99999463558197021484e-01 ) - fX2 * (float( -4.99995589256286621094e-01 ) + fX2 * (float( 4.16610352694988250732e-02 ) + fX2 * (float( -1.38627504929900169373e-03 ) + fX2 * (float( 2.42532332777045667171e-05 ) + fX2 * float( -2.21941789391166821588e-07 ))))) : float( 9.99999463558197021484e-01 ) + fX2 * (float( -4.99995589256286621094e-01 ) + fX2 * (float( 4.16610352694988250732e-02 ) + fX2 * (float( -1.38627504929900169373e-03 ) + fX2 * (float( 2.42532332777045667171e-05 ) + fX2 * float( -2.21941789391166821588e-07 ))))); }Error is derived from 12,000,000 samples from -PI to PI. Here are the results.SinI derived these values with lower averages and lower max errors. I also added the constraint that it never generate values above 1. Unreal Engine uses these constants:1.0 0.16666667 0.0083333310 0.00019840874 2.7525562e-06 2.3889859e-08Average error: 0.000011496102666666666666666666666667Max error: 0.000187166996 Garrett uses these constants:9.99999707044156546685e-01 1.66665772196961623983e-01 8.33255814755188010464e-03 1.98125763417806681909e-04 2.70405218307799040084e-06 2.05342856289746600727e-08Average error: 0.000000086373766666666666666666666666667Max error: 0.000000879168510I use the these constants:9.99999701976776123047e-01 1.66665777564048767090e-01 8.33255797624588012695e-03 1.98125766473822295666e-04 2.70405212177138309926e-06 2.05329886426852681325e-08Average error: 0.000000071991304583333333333333333333333Max error: 0.000000543892384The Unreal Engine 4 implementation is extremely inaccurate and branch-heavy. It also produces values above 1. CosGarrett’s constants produced values above 1, so I did not add that constraint to my new constants. And in fact I only changed one constant, and only by a very tiny amount.Unreal Engine uses these constants:1.0 0.5 0.041666638 0.0013888378 2.4760495e-05 2.6051615e-07Average error: 0.00010072299666666666666666666666667Max error: 0.000762462616Garrett uses these constants:9.99999443739537210853e-1 4.99995582499065048420e-01 4.16610337354021107429e-02 1.38627507062573673756e-03 2.42532401381033027481e-05 2.21941782786353727022e-07Average error: 0.00000084569900833333333333333333333333Max error: 0.00000244379044I use the these constants:9.99999463558197021484e-01 4.99995589256286621094e-01 4.16610352694988250732e-02 1.38627504929900169373e-03 2.42532332777045667171e-05 2.21941789391166821588e-07Average error: 0.00000084443936666666666666666666666667Max error: 0.00000244379044Although all of my numbers look different from Garrett’s, that is only because his numbers are prior to casting to float.After a cast to float, all of his numbers are the same as mine except the 4th, which I lowered very slightly.L. Spiro Edited September 3, 2016 by L. Spiro 1 Share this post Link to post Share on other sites
TheChubu 9459 Report post Posted September 3, 2016 If that's true, you could probably remove the branch entirely Or wrap around the "i32I & 1" one bit to the right then use that to flip _fX's sign bit accordingly. In any case, no branch is faster than branch on my end (using doubles). 2 Share this post Link to post Share on other sites
L. Spiro 25641 Report post Posted September 3, 2016 I’ve tested the branchless paths some time ago, because branchless should intuitively be the way to go.They didn’t give the performance I desired on consoles. I didn’t investigate deeply why that was because I was on the clock.It will be one of my upcoming investigations. For now I am focusing on accuracy, and next I will focus on performance.L. Spiro 3 Share this post Link to post Share on other sites