Jump to content

  • Log In with Google      Sign In   
  • Create Account

samoth

Member Since 18 Jan 2008
Offline Last Active Today, 02:57 PM

#5312081 Leveling up through mini-quests?

Posted by on Yesterday, 05:19 AM

The problem that I have with these mini-quests is that someone has to generate them, and they will inevitably get boring (there's only so and so many things you can give as quest).

I somewhat like the way stats work in NetHack. Every XYZ you do will "exercise" a stat, and every so-and-so-many turns a check is made whether you have exercised, and there is a small (and diminiushing) chance that the exercised stat will go up.

Now, of course, NetHack wouldn't be NetHack if there wasn't "abusing" a skill, too. Which you can easily do, and there's like 10,000 things you need to keep in mind, or you'll regret.

Basically, remember to carry around enough weight to get stressed (exercises strength) but avoid getting hungry (abuses constitution), and do not carry enough weight to get strained, or even overloaded, since dexterity (and eventually constitution) will be abused.

But yeah, in general, that's a cool concept. Only a tidbit too complex for me, with a little too many pitfalls (I guess that's just the challenge to a hardcore NetHack player, though).


#5310332 fast sin approx algo not faster than sin() ?

Posted by on 11 September 2016 - 05:43 AM

Time to test:
#include "iacaMarks.h"
#include <math.h>
double fsin(double x){
return 1.27323954474 * x + -0.40528473456 * x * fabs(x);
}

int main(int argc, char** argv)
{
    double d = 0.0;
    
    IACA_START
        for(int i = 1; i < 1000; ++i)
            d += fsin(argc);
    IACA_END
    
    return (int)d;
}
(drum roll)
X:\iaca\iaca-win64>gcc -O3 -march=skylake test.cpp -o test.exe && iaca -64 test.exe
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - test.exe
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.25 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 2.5    0.0  | 2.5  | 1.5    1.5  | 1.5    1.5  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm1, xmm1, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | mov eax, 0x3e7
|   2    |           | 1.0 |           |           |     | 1.0 |     |     |    | vcvtsi2sd xmm1, xmm1, ebx
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmulsd xmm0, xmm1, qword ptr [rip+0x1305]
|   0*   |           |     |           |           |     |     |     |     |    | vmovapd xmm2, xmm1
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | vandpd xmm2, xmm2, xmmword ptr [rip+0x1309]
|   1    | 0.9       | 0.1 |           |           |     |     |     |     |    | vmulsd xmm0, xmm0, xmm2
|   2^   | 0.6       | 0.4 | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vfmadd132sd xmm1, xmm0, qword ptr [rip+0x130c]
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm0, xmm0, xmm0
|   0*   |           |     |           |           |     |     |     |     |    | nop dword ptr [rax+rax*1], eax
|   1    |           | 1.0 |           |           |     |     |     |     |    | vaddsd xmm0, xmm0, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | sub eax, 0x1
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff9
Total Num Of Uops: 12
Well, coming as 100% for free, and being so darn easy to use, I'd say this is a really awesome tool. Only Skylake does not seem to be supported yet, Haswell is the latest arch this version does.


#5310331 fast sin approx algo not faster than sin() ?

Posted by on 11 September 2016 - 05:27 AM

You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

&amp;nbsp;
This has been nearly impossible to do since at least the time of the Pentium II. The only way to tell how long a piece of code takes is to run it, and even then the exact way in which you run it (cache usage, predictability of branches,&amp;nbsp; whether the result can be computed at compile time, whether the function call can be inlined...) matters a lot.
Well, there is IACA, which does just that. Unluckily, Intel wouldn't be Intel if their website actually worked and you were able to download it.
Gah... I figured it out. Intel... Khaaaaan!

Looking at where the redirection loop gets you (if you paste the link into wget, it spits out the redirections), you land on a site that requires you be logged in with an Intel developer account.

Which would be great if only someone told you! There is a page "accept EULA, download here" and when you click on the link, it just keeps redirecting your browser in an endless loop until the browser gives up.

So... the solution is simple: register an account (use a mailinator address or whatever), log in prior to going to the IACA download page, and it works just fine. :)


#5310121 Could you recommend a lightweight database?

Posted by on 09 September 2016 - 09:17 AM

Postgresql is the best free all-rounder. MySQL is also okay.

Is this funded on concrete experience or even hard evidence? I'm asking because I hear "postgres is better" a lot, but the only real downside in MySQL that I see is that it's not 100% compliant with every little bit of SQL pedantery.

I've used MySQL for decades, and it has never bitten me. Not in an unexpected way anyway... sure, if you use a non-ACID engine, you may lose some... happened maybe 3 or 4 times in my life. But this is something you know in advance. I yet have to see a failure with Inno.

On the other hand, even getting something up and running is (in my opinion) a pain with Postgres, and when I tried, it felt like quite a bit sliggish compared to what I was used to. But maybe that was just because I know nothing of Postgres and did something wrong...

About the actual question, does that huge amount of data have to be saved so often anyway? Items change slots maybe a few dozen times during an entire game session, and they change owner only rarely. Why not write trades to a local journal just to be sure you can recover, and coalesce all item updates once per minute (or less often, even!) to the database.

Similarly, all player stats can easily be held in RAM only, important things logged to a journal just in case, and regularly flushed. If your server crashes so often that this may be an issue, you habe a different problem to look after! And otherwise, if this happens maybe once every other month, no user will be angry if they get restored to full health and full mana upon a crash... as long as they don't lose the Sword of Almighty that they just won a minute earlier (...which the journal prevents from happening to the best of your ability).


#5309820 Faster Sin and Cos

Posted by on 07 September 2016 - 09:38 AM

So you can guess how much I must have been bored during the last hour. I actually moved all those tweakables to the top of the file, and while at it added support for Linux (using clock_gettime with thread cpu time), and for sorting tests by execution time. :lol:
 
If you reduce the number of iterations so it doesn't run against the quota, you can run it nicely on Coliru: http://coliru.stacked-crooked.com/a/9bd33fd63db7edd5

(I am unsure what architecture Coliru builds for, but looking at the mere 6-7 times performance improvement, I guess it's definitively not something recent with SSE/AVX/FMA support)

 
Now, please do not mention "faster sin" to me ever again in my life.




#5309788 Faster Sin and Cos

Posted by on 07 September 2016 - 05:25 AM

Almost everything in the test code is kinda easily configurable.

 

The window for the benchmark and for the max/average/rms error run is set by defaulted parameters. If you want something different, you can just call the test function with whatever you like (note that I'm using a macro to stringify the function name because I'm too lazy to type it twice, so it's probably easiest to just add two params to the macro).

 

I thought that for benchmarking, +/-3.14 was entirely appropriate. Indeed, I was tempted to only do +/- 1 at first because the execution time should not depend on the input for these approximations (it might, but does not, affect the C lib's execution time).

 

For the error calculation, it would probably have been very slightly more correct to do a greater range. Mea culpa, When writing this, I just copy/pasted the function, edited the name, and changed the body accordingly. I think it's still 99.9% good, though.

 

If a value >1 or <-1 is encountered, the function increments a counter, then outputs how many offenders it encountered, and then follows up bracketing all the extrema (I made the assumption that no appromination returns complete bullshit, so if there are values outside the correct range, they must be at the extrema). The bracketing is done by bisection finding, starting with a window 0.1 wide both to the left and to the right of each extremum. This is, for a change, not easily configurable... but 0.1 on each side is quite big, I think it should be enough. An algorithm requiring something larger would mean more than 6% of its ouput values are outside the permitted range! If an algorithm does that, it's complete rubbish.

 

The "simple" sanity tests are done at precise values (well, as precise as I'm getting them from math.h) for +/- PI, +/- PI/2, and 0.

 

The monotony tests are done by selecting random samples within an interval starting at the exact location of an extremum, and ending at the location of an extremum minus the value of eps. It then calculates the value for f(t) and f(t+eps) and checks that, depending on whether the function is supposed to be raising or falling in this interval, the results compare correctly.

 

You can change the value of eps (there's a constexpr with that name inside sanity_impl), in the posted tests, I used 0.001 (tried smaller values as well, no big difference -- algorithms that "pass" with one constant, pass with another too -- and when they "fail", they fail with the other as well... no observable difference on my end, other than number of incidents being different).

A few lines below, in the same function, there is report_threshold (in my tests 3) which you can change if you want more output lines. The test reports up to report_threshold incidents verbosely, giving location, expected, and actual value, and only reports the total number for the rest.

 

Ah heck, I should probably made all those constants global scope and moved them to the top of the file...




#5309782 fast sin approx algo not faster than sin() ?

Posted by on 07 September 2016 - 04:48 AM

The volatile keyword basically prevents any optimization in the loop.

That's right, but volatile is not supposed to be in the loop.

 

It is however very important. It acts as sink. The sin function or any of its approximations is a pure function, that is, its effects only depend on its inputs, and its execution does not influence any state other than its return value. In other words, calling or not calling the function is the same, nobody but the consumer of the return value knows a difference, no externally observable effect..

The compiler will thus, according to the as-If rule, translate this:

int main()
[
    timer_start();
    for(1,000,000 times)
        expensive_pure_function();
    print(timer.stop())
}

to just that:

int main()
{
    timer.start();
    printf(timer.stop());
}

The use of volatile to consume each result prevents it from performing that optimization:

int main()
[
    double sum{};

    timer_start();
    for(1,000,000 times)
        sum += expensive_pure_function();   // no volatile here
    print(timer.stop())
    
    volatile double discard = sum;   // sink
}

Alternatively, you could just print the result (but printing a meaningless number is ugly and confusing), or either cast the accumulated result to int and return it from main, or you could omit initializing the accumulating variable. Returning the value from main is somewhat clumsy due to the cast, and will give a narrowing or "loses precision" warning, and the other solution uses an indeterminate value. Which, although in practice it most of the time "works" (the compiler cannot determine what the result will be, so it has no choice but to execute everything) is technically undefined behavior. The thing with undefined behavior is, you just never know for sure. The compiler might just as well show you the middle finger and optimize out the entire main function because of UB.

 

With volatile, you are telling the compiler, in an easy and well-defined way, that it is required to set the value of discard at the end (even though it's not otherwise used). It is not allowed to reorder that move, or to omit it alltogether.

In order to do that, the compiler must necessarily know the value and therefore cannot optimize out the calculation. It could, admittedly, run the complete set of calculations at compiletime, and for iteraton counts in the range below 500 or so, this indeed happens. But for iteration counts in the millions, no chance.




#5309700 fast sin approx algo not faster than sin() ?

Posted by on 06 September 2016 - 11:15 AM

Unless every single CPU cycle counts, I recommend you use Spiro's version (text-replace float with double). If that's too slow, use Garrett's.

The reason is that while your version is about 0.5 nanoseconds faster than Garrett (and nearly twice as fast as Spiro-C11 in double) on the average, over 500 million samples (which, frankly, means nothing at all!), it is abysmal when it comes to max error, average error, and root of mean squared error. It also fails every single one of the sanity tests on every value. In other words, it loses on every metric.

That may not matter at all, it can still be a perfectly usable approximation for your case. But there may be cases (which you cannot even foresee now) where it might just matter, too. And using something that is vastly better doesn't really cost you anything. Same timing, more or less, and code readily available, just copy and paste.


#5309630 fast sin approx algo not faster than sin() ?

Posted by on 06 September 2016 - 03:36 AM

That's what I guess yeah. But we still find "modern" (2012) posts that talks abouts faster sin functions ( http://www.gamedev.net/topic/621589-extremely-fast-sin-approximation/ )

 

Dunno if hardware trigonomety functions have been implemented only very recently?

There is a current thread on this: http://www.gamedev.net/topic/681723-faster-sin-and-cos/

 

Bottom line: a pretty good custom implementation (way better than yours, and about twice as much work) can be easily 10 times, if not 20 times faster than the standard library.

 

Hardware trig functions have been around since... forever (I think 386 had it already). But they have latencies in the 200 cycles range (and I think you can only start one every 200 or so cycles, too). Even traditional 387 math implementations are usually faster (and approximations anyway).

 

On modern hardware, a couple of fused-multiply-add instructions (which is what your code boils down to) have a latency of maybe a dozen or so cycles at most, and depending on how well the actual code allows pipelining to kick in, it might effectively cost you 1-2 cycles in the best case.

 

 

For the timer i use the the high resolution timer available on the platform

This is mighty fine for anthing like at least 10 million or so iterations. QPC has a resolution of about 0.3 µs on my system (which is also an i7).

Thus, you can say that anything that takes at least something-millisecond will be fine to measure with good precision, anything that is something-micro (like ten thousand thousand to a million iterations) is kinda acceptable, but precision will not be stellar, and anything that is something-nano (such as only running e.g. 5-10 iterations) is a pure bullshit measure.

 

You can time single iterations (or few) using the CPU's TSC counter, but there are three important gotchas. First, TSC may not be in sync between different cores. That's said to be mostly a "historic problem" and no longer the case, but I just had "negative time" on my current system trying this last week. Second, what you measure is processor cycles, not time. That's actually an advantage because you are no longer subject to different clock rates. If your algorithm takes 15 CPU cycles, it will take 15 cycles at 800MHz and it will still take 15 cycles at 4,000MHz (only now cycles are faster!). It's someting to be aware of, however... some people use TSC for measuring time, and that's bound to fail. Lastly, to give a precise measurement, you must completely empty the processor pipeline before taking the TSC. In other words, rather than RDTSC, you must execute CPUID followed by RDTSC (CPUID being the only synchronizing instruction that you can call in a user process).
 




#5309506 Faster Sin and Cos

Posted by on 05 September 2016 - 05:52 AM

Here is a test with gcc-4.8.1 (32 bit) targetting nocona. Only reliably working 32-bit compiler that I have on this machine. In order to compile, I had to text-replace the thousands-delimiter (') and constexpr since gcc-4.8.1 does not support C++14 yet (-O3 enables -ffast-math, so it's not explicitly given)

Holy ffffffffuuuuuuuuuck, is this slow! Same machine, same numer of iterations, and twice as much time. Relatively, the C lib funcition becomes "faster".

Overall SSE math is somewhat slower for all implementations than 387, although it does help Spiro-double (it's devastating to Spiro-single, however... almost 4x slower for C15 and 3x for C11).

About precision: Spiro wins both hands down on absolute and average error with 387, loses on RMSE. With SSE-math, Spiro wins big time on max error but loses otherwise, also the C15 has sin(x)>1 issues (C11 does not).

Results for gcc --std=gnu++11 -O3 -march=nocona

QPC resolution      = 3328398 ticks/sec
Iterations per test = 500000000


timings
sin               : 83964145      [0,050453 us per iteration] --> 1,0 : 1
sin_garrett_c11   : 5697282       [0,003423 us per iteration] --> 14,7 : 1
sin_garrett_c11_s : 6170228       [0,003708 us per iteration] --> 13,6 : 1
sin_spiro_s       : 13514811      [0,008121 us per iteration] --> 6,2 : 1
sin_spiro         : 13132087      [0,007891 us per iteration] --> 6,4 : 1
sin_adam42_s      : 16948266      [0,010184 us per iteration] --> 5,0 : 1
sin_adam42        : 15594794      [0,009371 us per iteration] --> 5,4 : 1
sin_spiro_c11_s   : 9764995       [0,005868 us per iteration] --> 8,6 : 1
sin_spiro_c11     : 10934390      [0,006570 us per iteration] --> 7,7 : 1

error metrics
sin               : emax=0,000000000000000 eavg=0,000000000000000 sse=0,000000000000000 rmse=0,000000000000000   0 values > 1.0
sin_garrett_c11   : emax=0,000000291691941 eavg=0,000000051244667 sse=0,000001752641653 rmse=0,000000059205433   0 values > 1.0
sin_garrett_c11_s : emax=0,000000410737697 eavg=0,000000056510521 sse=0,000002462154325 rmse=0,000000070173418   0 values > 1.0
sin_spiro_s       : emax=0,000000126618037 eavg=0,000000024659305 sse=0,000000666347036 rmse=0,000000036506083   0 values > 1.0
sin_spiro         : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_adam42_s      : emax=0,000000126617319 eavg=0,000000024659305 sse=0,000000666347036 rmse=0,000000036506083   0 values > 1.0
sin_adam42        : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_spiro_c11_s   : emax=0,000000279356114 eavg=0,000000065894434 sse=0,000003434928817 rmse=0,000000082884604   0 values > 1.0
sin_spiro_c11     : emax=0,000000167929271 eavg=0,000000061079306 sse=0,000002831597331 rmse=0,000000075254200   0 values > 1.0

sanity tests
sin technically failed the sin(-pi) == 0 test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : -0,0000000000000001
    diff   : +0,0000000000000001
sin technically failed the sin(pi) == 0 test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : +0,0000000000000001
    diff   : +0,0000000000000001
sin_garrett_c11  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000003055730729
    diff   : +0,0000003055730729
sin_garrett_c11  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000003055730729
    diff   : +0,0000003055730729
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8006437385161131
    f(t)     = 0,8191135033185942
    f(t+eps) = 0,8190794596769832
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224516653635
    f(t+eps) = 0,8200543801193480
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -6,0921383176898001
    f(t)     = 0,9266029064291210
    f(t+eps) = 0,9258043767902342
---> 1780385 incidents total (5000000 samples)
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 5,9531260762097542
    f(t)     = -0,8475519144872781
    f(t+eps) = -0,8479151826540377
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,1518134216550120
    f(t)     = -0,9812926066411312
    f(t+eps) = -0,9823326492861187
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,2285178390787088
    f(t)     = -1,0741594961994332
    f(t+eps) = -1,0755554115382013
---> 1781416 incidents total (5000000 samples)
sin_garrett_c11_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000003929966113
    diff   : +0,0000003929966113
sin_garrett_c11_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000003929966113
    diff   : +0,0000003929966113
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8006437385161131
    f(t)     = 0,8191134929656982
    f(t+eps) = 0,8190794587135315
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224803924561
    f(t+eps) = 0,8200544118881226
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -6,0921383176898001
    f(t)     = 0,9266029000282288
    f(t+eps) = 0,9258044362068176
---> 1780673 incidents total (5000000 samples)
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 5,9531260762097542
    f(t)     = -0,8475518822669983
    f(t+eps) = -0,8479151129722595
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,1518134216550120
    f(t)     = -0,9812927246093750
    f(t+eps) = -0,9823326468467712
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,2285178390787088
    f(t)     = -1,0741597414016724
    f(t+eps) = -1,0755555629730225
---> 1781717 incidents total (5000000 samples)
sin_spiro_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726522
    diff   : +0,0000000131273478
sin_spiro_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726522
    diff   : +0,0000000131273478
sin_spiro_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726539
    diff   : +0,0000000131273461
sin_spiro  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726539
    diff   : +0,0000000131273461
sin_adam42_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726522
    diff   : +0,0000000131273478
sin_adam42_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726522
    diff   : +0,0000000131273478
sin_adam42_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726539
    diff   : +0,0000000131273461
sin_adam42  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726539
    diff   : +0,0000000131273461
sin_spiro_c11_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000002572315533
    diff   : +0,0000002572315533
sin_spiro_c11_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999175527640
    diff   : +0,0000000824472361
sin_spiro_c11_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999175527640
    diff   : +0,0000000824472361
sin_spiro_c11_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000002572315533
    diff   : +0,0000002572315533
sin_spiro_c11  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000001698080954
    diff   : +0,0000001698080954
sin_spiro_c11  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999175527783
    diff   : +0,0000000824472218
sin_spiro_c11  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999175527783
    diff   : +0,0000000824472218
sin_spiro_c11  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000001698080954
    diff   : +0,0000001698080954

Results for gcc --std=gnu++11 -O3 -march=nocona -msse -msse2 -msse3 -fpmath=sse

QPC resolution      = 3328398 ticks/sec
Iterations per test = 500000000


timings
sin               : 84243075      [0,050621 us per iteration] --> 1,0 : 1
sin_garrett_c11   : 6152364       [0,003697 us per iteration] --> 13,7 : 1
sin_garrett_c11_s : 9737130       [0,005851 us per iteration] --> 8,7 : 1
sin_spiro_s       : 49731322      [0,029883 us per iteration] --> 1,7 : 1
sin_spiro         : 12165871      [0,007310 us per iteration] --> 6,9 : 1
sin_adam42_s      : 27206552      [0,016348 us per iteration] --> 3,1 : 1
sin_adam42        : 13240343      [0,007956 us per iteration] --> 6,4 : 1
sin_spiro_c11_s   : 14041474      [0,008437 us per iteration] --> 6,0 : 1
sin_spiro_c11     : 9505062       [0,005711 us per iteration] --> 8,9 : 1

error metrics
sin               : emax=0,000000000000000 eavg=0,000000000000000 sse=0,000000000000000 rmse=0,000000000000000   0 values > 1.0
sin_garrett_c11   : emax=0,000000291691941 eavg=0,000000051244667 sse=0,000001752641653 rmse=0,000000059205433   0 values > 1.0
sin_garrett_c11_s : emax=0,000000485314278 eavg=0,000000058533900 sse=0,000002706517939 rmse=0,000000073573337   0 values > 1.0
sin_spiro_s       : emax=0,000000503048245 eavg=0,000000045625274 sse=0,000002303864921 rmse=0,000000067880261   1937 values > 1.0
    bracket around -3/2 pi where f(x) >  1 : [-4,712584292884690 ; -4,612388983364922]    interval = 0,100195309519767
    bracket around -pi/2   where f(x) < -1 : [-1,570942811169897 ; -1,470796329775129]    interval = 0,100146481394768
    bracket around  pi/2   where f(x) >  1 : [+1,570601014294897 ; +1,670796323814664]    interval = 0,100195309519768
    bracket around  3/2 pi where f(x) < -1 : [+4,712291324134689 ; +4,812388977404457]    interval = 0,100097653269768
sin_spiro         : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_adam42_s      : emax=0,000000627044312 eavg=0,000000053344115 sse=0,000003131099316 rmse=0,000000079134055   11181 values > 1.0
    bracket around -3/2 pi where f(x) >  1 : [-4,712584292884690 ; -4,711998355384690]    interval = 0,000585937499999
    bracket around -pi/2   where f(x) < -1 : [-1,570869568982396 ; -1,570405701794897]    interval = 0,000463867187500
    bracket around  pi/2   where f(x) >  1 : [+1,570405701794897 ; +1,570799378552709]    interval = 0,000393676757813
    bracket around  3/2 pi where f(x) < -1 : [+4,711998355384690 ; +4,712584292884690]    interval = 0,000585937499999
sin_adam42        : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_spiro_c11_s   : emax=0,000000663675650 eavg=0,000000076164470 sse=0,000005072194449 rmse=0,000000100719357   0 values > 1.0
sin_spiro_c11     : emax=0,000000167929271 eavg=0,000000061079306 sse=0,000002831597331 rmse=0,000000075254200   0 values > 1.0

sanity tests
sin technically failed the sin(-pi) == 0 test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : -0,0000000000000001
    diff   : +0,0000000000000001
sin technically failed the sin(pi) == 0 test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : +0,0000000000000001
    diff   : +0,0000000000000001
sin_garrett_c11  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000003055730729
    diff   : +0,0000003055730729
sin_garrett_c11  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000003055730729
    diff   : +0,0000003055730729
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8006437385161131
    f(t)     = 0,8191135033185978
    f(t+eps) = 0,8190794596769827
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224516653600
    f(t+eps) = 0,8200543801193539
sin_garrett_c11 failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -6,0921383176898001
    f(t)     = 0,9266029064291228
    f(t+eps) = 0,9258043767902408
---> 1780385 incidents total (5000000 samples)
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 5,9531260762097542
    f(t)     = -0,8475519144872734
    f(t+eps) = -0,8479151826540299
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,1518134216550120
    f(t)     = -0,9812926066411279
    f(t+eps) = -0,9823326492861131
sin_garrett_c11 failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,2285178390787088
    f(t)     = -1,0741594961994272
    f(t+eps) = -1,0755554115382078
---> 1781416 incidents total (5000000 samples)
sin_garrett_c11_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000004111419116
    diff   : +0,0000004111419116
sin_garrett_c11_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000004111419116
    diff   : +0,0000004111419116
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8006437385161131
    f(t)     = 0,8191134929656982
    f(t+eps) = 0,8190794587135315
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224803924561
    f(t+eps) = 0,8200543522834778
sin_garrett_c11_s failed the monotonically rising in [-2 pi, -3/2 pi] test
    at t     = -6,0921383176898001
    f(t)     = 0,9266027808189392
    f(t+eps) = 0,9258044362068176
---> 1780704 incidents total (5000000 samples)
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 5,9531260762097542
    f(t)     = -0,8475518822669983
    f(t+eps) = -0,8479151129722595
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,1518134216550120
    f(t)     = -0,9812927842140198
    f(t+eps) = -0,9823326468467712
sin_garrett_c11_s failed the monotonically rising in [3/2 pi, 2 pi] test
    at t     = 6,2285178390787088
    f(t)     = -1,0741598606109619
    f(t+eps) = -1,0755554437637329
---> 1781735 incidents total (5000000 samples)
sin_spiro_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999998807907105
    diff   : +0,0000001192092896
sin_spiro_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999998807907105
    diff   : +0,0000001192092896
sin_spiro_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726539
    diff   : +0,0000000131273461
sin_spiro  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726539
    diff   : +0,0000000131273461
sin_adam42_s  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42_s  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999868726539
    diff   : +0,0000000131273461
sin_adam42  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999868726539
    diff   : +0,0000000131273461
sin_spiro_c11_s  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999998807907105
    diff   : +0,0000001192092896
sin_spiro_c11_s  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999998807907105
    diff   : +0,0000001192092896
sin_spiro_c11  failed the sin(-pi) == 0 test   
    expect : +0,0000000000000000
    value  : +0,0000001698080958
    diff   : +0,0000001698080958
sin_spiro_c11  failed the sin(-pi/2) == -1 test   
    expect : -1,0000000000000000
    value  : -0,9999999175527783
    diff   : +0,0000000824472217
sin_spiro_c11  failed the sin(pi/2) == 1 test   
    expect : +1,0000000000000000
    value  : +0,9999999175527783
    diff   : +0,0000000824472217
sin_spiro_c11  failed the sin(pi) == 0 test   
    expect : +0,0000000000000000
    value  : -0,0000001698080958
    diff   : +0,0000001698080958



#5309499 Faster Sin and Cos

Posted by on 05 September 2016 - 04:40 AM

&amp;nbsp;

But your link is broken.

Works for me (just tried). You may have to enable Javascript, for some obscure reason the godbolt site doesn't work without JS. Uploaded it to ideone meanwhile, in case you don't get godbolt to cooperate: https://ideone.com/plain/vH7qr9

If that doesn't work either, I can post it here, but I'm a bit reluctant spamming the forums with a 15kB file if there's an easy way around :)

My tests have always shown...

Are you maybe compiling in 32-bit mode, or for an older architecture like PIII? This might be a possible reason. I'm compiling in 64-bits -O3 --march=skylake. This processor has FMA available, and the compiler is using it (albeit for all funcitons, not just some). Quite possibly, however, a not-FMA build might have slightly different results, both in performance and in precision.

It may be interesting to look at that.


#5309417 Faster Sin and Cos

Posted by on 04 September 2016 - 01:40 PM

Fixed those tests (had one sign wrong), looks more plausible now, and added a few more bells and whistles.

Which one is the best function depends greatly on what you need.

In summary, Spiro's original (C15) version is vastly more accurate than Garrett's C11 looking at maximal error. His C11 version is competitive in double mode, but not so in single mode. Spiro's C15 beats Garrett's C11 on average error, but his C11 loses out. Looking at RMSE, Spiro's C11 version is inferior (the C15 version is still better than Garrett's C11, but that is unsurprising).

Garrett's C11 function is undisputedly the fastest of all competitors. It does however have an issue about which I'm not quite how severe it is (read on!).

All competitors, no exceptions, failed (or at least came close to failing) one or the other sanity test. The C standard library does not report sin(0) as exactly 0.0 (although it is close to one digit, but really... a standard library function should be 100% exact on the few "well known" locations). All of the other implementations are somewhat worse, but still within at least 6-7 decimals, so I guess that's OK. We're talking about reasonably good fast approximations, not exact science, after all!

Spiro's C15 and Adam_42's reordered version both have intervals where f(x) > +/-1. This may be an annoying property, but I think it's not too bad.

However, the thing that makes me worry the most is that Garrett's version seems to be not monotonically rising/falling when it should. The code tests random positions within given intervals and compares e.g. f(t) < f(t+x) where x is a small constant. It happens that the comparison isn't always true. In other words, the curve must be somewhat "wiggly". Now, I am not sure if this couldn't cause problems if one relies on monotony. But maybe that's a complete non-issue... not sure. It could for example matter if you did many small incremental rotations. But nobody is doing that for obvious reasons... so... not sure?

I've uploaded the test program to godbolt.orgfor whoever might want to play further with it (but note you can't run it there, it takes about 4 mins per run).

Complete output from one test run on my desktop:
 

QPC resolution      = 3,328,212 ticks/sec
Iterations per test = 500,000,000


timings
sin               : 60,082,423    [0,036105 us per iteration] --> 1,0 : 1
sin_garrett_c11   : 2,856,948     [0,001717 us per iteration] --> 21,0 : 1
sin_garrett_c11_s : 4,166,675     [0,002504 us per iteration] --> 14,4 : 1
sin_spiro_s       : 7,969,460     [0,004789 us per iteration] --> 7,5 : 1
sin_spiro         : 4,588,002     [0,002757 us per iteration] --> 13,1 : 1
sin_adam42_s      : 12,541,737    [0,007537 us per iteration] --> 4,8 : 1
sin_adam42        : 7,973,903     [0,004792 us per iteration] --> 7,5 : 1
sin_spiro_c11_s   : 4,988,171     [0,002998 us per iteration] --> 12,0 : 1
sin_spiro_c11     : 3,852,504     [0,002315 us per iteration] --> 15,6 : 1

error metrics
sin               : emax=0,000000000000000 eavg=0,000000000000000 sse=0,000000000000000 rmse=0,000000000000000   0 values > 1.0
sin_garrett_c11   : emax=0,000000291691941 eavg=0,000000051244667 sse=0,000001752641653 rmse=0,000000059205433   0 values > 1.0
sin_garrett_c11_s : emax=0,000000485314278 eavg=0,000000058533900 sse=0,000002706517939 rmse=0,000000073573337   0 values > 1.0
sin_spiro_s       : emax=0,000000395600644 eavg=0,000000040108091 sse=0,000001704967828 rmse=0,000000058394654   380 values > 1.0
    bracket around -3/2 pi where f(x) >  1 : [-4,712413394447189 ; -4,712291324134689]    interval = 0,000122070312500
    bracket around -pi/2   where f(x) < -1 : [-1,570820740857397 ; -1,570601014294897]    interval = 0,000219726562500
    bracket around  pi/2   where f(x) >  1 : [+1,570601014294897 ; +1,570820740857397]    interval = 0,000219726562500
    bracket around  3/2 pi where f(x) < -1 : [+4,712291324134689 ; +4,712413394447189]    interval = 0,000122070312500
sin_spiro         : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_adam42_s      : emax=0,000000626152196 eavg=0,000000044799408 sse=0,000002159329615 rmse=0,000000065716507   2033 values > 1.0
    bracket around -3/2 pi where f(x) >  1 : [-4,712584292884689 ; -4,712340152259689]    interval = 0,000244140625000
    bracket around -pi/2   where f(x) < -1 : [-1,570845154919897 ; -1,570601014294897]    interval = 0,000244140625000
    bracket around  pi/2   where f(x) >  1 : [+1,570601014294897 ; +1,570845154919897]    interval = 0,000244140625000
    bracket around  3/2 pi where f(x) < -1 : [+4,712340152259689 ; +4,712584292884689]    interval = 0,000244140625000
sin_adam42        : emax=0,000000019798559 eavg=0,000000008708419 sse=0,000000063015472 rmse=0,000000011226350   0 values > 1.0
sin_spiro_c11_s   : emax=0,000000540252794 eavg=0,000000072925062 sse=0,000004476188423 rmse=0,000000094617001   0 values > 1.0
sin_spiro_c11     : emax=0,000000167929271 eavg=0,000000061079306 sse=0,000002831597331 rmse=0,000000075254200   0 values > 1.0

sanity tests
sin technically failed the 'sin(-pi) == 0' test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : -0,0000000000000001
    diff   : +0,0000000000000001
sin technically failed the 'sin(pi) == 0' test   (but has more than 10 good digits)
    expect : +0,0000000000000000
    value  : +0,0000000000000001
    diff   : +0,0000000000000001
sin_garrett_c11  failed the 'sin(-pi) == 0' test   
    expect : +0,0000000000000000
    value  : +0,0000003055730728
    diff   : +0,0000003055730728
sin_garrett_c11  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999999478418549
    diff   : +0,0000000521581451
sin_garrett_c11  failed the 'sin(pi) == 0' test   
    expect : +0,0000000000000000
    value  : -0,0000003055730728
    diff   : +0,0000003055730728
sin_garrett_c11 failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -5,8006437385161131
    f(t)     = 0,8191135033185917
    f(t+eps) = 0,8190794596769879
sin_garrett_c11 failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224516653609
    f(t+eps) = 0,8200543801193481
sin_garrett_c11 failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -6,0921383176898001
    f(t)     = 0,9266029064291225
    f(t+eps) = 0,9258043767902340
---> 1,780,385 incidents total (5,000,000 samples)
sin_garrett_c11 failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 5,9531260762097542
    f(t)     = -0,8475519144872790
    f(t+eps) = -0,8479151826540420
sin_garrett_c11 failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 6,1518134216550120
    f(t)     = -0,9812926066411358
    f(t+eps) = -0,9823326492861200
sin_garrett_c11 failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 6,2285178390787088
    f(t)     = -1,0741594961994370
    f(t+eps) = -1,0755554115381964
---> 1,781,416 incidents total (5,000,000 samples)
sin_garrett_c11_s  failed the 'sin(-pi) == 0' test   
    expect : +0,0000000000000000
    value  : +0,0000004111419116
    diff   : +0,0000004111419116
sin_garrett_c11_s  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999999403953552
    diff   : +0,0000000596046448
sin_garrett_c11_s  failed the 'sin(pi) == 0' test   
    expect : +0,0000000000000000
    value  : -0,0000004111419116
    diff   : +0,0000004111419116
sin_garrett_c11_s failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -5,8006437385161131
    f(t)     = 0,8191134929656982
    f(t+eps) = 0,8190794587135315
sin_garrett_c11_s failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -5,8201378991877419
    f(t)     = 0,8201224803924561
    f(t+eps) = 0,8200543522834778
sin_garrett_c11_s failed the 'monotonically rising in [-2 pi, -3/2 pi]' test
    at t     = -6,0921383176898001
    f(t)     = 0,9266027808189392
    f(t+eps) = 0,9258044362068176
---> 1,780,704 incidents total (5,000,000 samples)
sin_garrett_c11_s failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 5,9531260762097542
    f(t)     = -0,8475518822669983
    f(t+eps) = -0,8479151129722595
sin_garrett_c11_s failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 6,1518134216550120
    f(t)     = -0,9812927842140198
    f(t+eps) = -0,9823326468467712
sin_garrett_c11_s failed the 'monotonically rising in [3/2 pi, 2 pi]' test
    at t     = 6,2285178390787088
    f(t)     = -1,0741598606109619
    f(t+eps) = -1,0755554437637329
---> 1,781,735 incidents total (5,000,000 samples)
sin_spiro_s  failed the 'sin(-pi) == 0' test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro_s  failed the 'sin(pi) == 0' test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_spiro  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999999868726539
    diff   : +0,0000000131273461
sin_spiro  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999999868726539
    diff   : +0,0000000131273461
sin_adam42_s  failed the 'sin(-pi) == 0' test   
    expect : +0,0000000000000000
    value  : +0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42_s  failed the 'sin(pi) == 0' test   
    expect : +0,0000000000000000
    value  : -0,0000000874227766
    diff   : +0,0000000874227766
sin_adam42  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999999868726537
    diff   : +0,0000000131273463
sin_adam42  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999999868726537
    diff   : +0,0000000131273463
sin_spiro_c11_s  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999998807907104
    diff   : +0,0000001192092896
sin_spiro_c11_s  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999998807907104
    diff   : +0,0000001192092896
sin_spiro_c11  failed the 'sin(-pi) == 0' test   
    expect : +0,0000000000000000
    value  : +0,0000001698080958
    diff   : +0,0000001698080958
sin_spiro_c11  failed the 'sin(-pi/2) == -1' test   
    expect : -1,0000000000000000
    value  : -0,9999999175527783
    diff   : +0,0000000824472217
sin_spiro_c11  failed the 'sin(pi/2) == 1' test   
    expect : +1,0000000000000000
    value  : +0,9999999175527783
    diff   : +0,0000000824472217
sin_spiro_c11  failed the 'sin(pi) == 0' test   
    expect : +0,0000000000000000
    value  : -0,0000001698080958
    diff   : +0,0000001698080958

 




#5309328 Faster Sin and Cos

Posted by on 03 September 2016 - 02:01 PM

So I implemented a few sanity tests for fun, such as what's the value for sin(0) or sin(-PI), or whether functions are monotonically rising/falling within the respective intervals.

 

Maybe my tests are whacky, will have to look closer tomorrow morning... but... not a single implementation passes the tests :lol:

(including the C lib)




#5309318 Faster Sin and Cos

Posted by on 03 September 2016 - 11:50 AM

A couple of updates. First, on why double precision is faster than single...

vmulss %xmm0,%xmm0,%xmm1
vcvtss2sd %xmm0,%xmm0,%xmm0
vcvtss2sd %xmm1,%xmm1,%xmm1
vfmadd213sd 0x8b3b(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b3a(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b39(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b38(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b37(%rip),%xmm2,%xmm1
vmulsd %xmm0,%xmm1,%xmm0
vcvtsd2ss %xmm0,%xmm0,%xmm0  <-------- aha!

Yep, that's why. Both do the same, except the last line.

 

Second, having looked at the disassembly now, I'm shocked that GCC doesn't even inline a single of these functions! It actually does function calls! Why?
Well, I'm using a function that takes a functor like this:

template<typename F> auto test(F f, [blah blah])
{
    ...
    qpc.start();
        for(...)
            sum += f(t);
    qpc.stop();

    volatile double discard = sum;

    return qpc;
}

The assumption is that, of course, the compiler will inline that function pointer to a simple three-liner since it's being called on a darn constant expression. Which, of course, the compiler can see immediately, just like it can see immediately that the function is rather trivial and easily inlineable. Guess what, it doesn't. However, changing the function call to a lambda will work just fine. Tell me about being unlogical. Grrr...

 

On the positive side, this means that the custom functions are really faster because they're faster, not because the compiler inlines them and doesn't inline the library call.

Now, trying to get GCC to auto-vectorize this over an array of 10k doubles doesn't seem to work. Even if you "help" it and manually unroll the loop 4 times, it just generates 4 times the scalar code. Oh well, was worth trying.




#5309287 Faster Sin and Cos

Posted by on 03 September 2016 - 04:25 AM

But double precision runs faster than single precision version, anyway

With what compiler settings?
X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.
With any setting, funnily. Complete timings with and without --fast-math (also with SSE and 387 math, and combi-mode) are 7 posts above. Double is faster every time.
Must be that all calculations are double anyway (sure are for 387, but I would have assumed differently for SSE), and single precision forces an explicit conversion somewhere. Whatever it is, we're talking like 7.5% on an operation in the single-nanosecond range, so really no biggie.

Might be more visible on a single call. Maybe it's woth RDTSCing a single call to see how many cycles go into it. I assume there is some really hefty pipelining involved in that benchmark because if you break down the time measured by QPC and number of iterations, you get a number that's around 5-6 or so clock cycles for the fastest method. Which is obviously much faster than this chain of multiplications and additions can possibly be. Unless QPC is lying to me... but wall clock says something very similar. Or very deep pipelining.

But in a real application where it matters (because you do hundred thousands of calls in a batch), you will have that very same pipelining, so I guess the benchmark is not "wrong" as such. If you only have to calculate a handful of sines, who cares anyway.




PARTNERS