fast sin approx algo not faster than sin() ?

Started by
8 comments, last by apatriarca 7 years, 7 months ago
Unless every single CPU cycle counts, I recommend you use Spiro's version (text-replace float with double). If that's too slow, use Garrett's.

The reason is that while your version is about 0.5 nanoseconds faster than Garrett (and nearly twice as fast as Spiro-C11 in double) on the average, over 500 million samples (which, frankly, means nothing at all!), it is abysmal when it comes to max error, average error, and root of mean squared error. It also fails every single one of the sanity tests on every value. In other words, it loses on every metric.

That may not matter at all, it can still be a perfectly usable approximation for your case. But there may be cases (which you cannot even foresee now) where it might just matter, too. And using something that is vastly better doesn't really cost you anything. Same timing, more or less, and code readily available, just copy and paste.
Advertisement

if i remove the "volatile" keyword for the variable used in the test, standard sin() become uber-fast with a score of approx 5600. I guess it ignores 99 of the 100 iterations loop because the values are already in the cpu cache

The volatile keyword basically prevents any optimization in the loop. Moreover, each loop iteration depends on the previous one which means low pipeline usage in the CPUs and no chance for SIMD auto-vectorization. This isn't a particularly meaningful benchmark and I think it is not representative of the real performances of these functions.

I would simply use sin or sinf. In most cases it should use the correct cpu instruction, eventually also using SIMD ones or treating them as constant expressions.

The volatile keyword basically prevents any optimization in the loop.

That's right, but volatile is not supposed to be in the loop.

It is however very important. It acts as sink. The sin function or any of its approximations is a pure function, that is, its effects only depend on its inputs, and its execution does not influence any state other than its return value. In other words, calling or not calling the function is the same, nobody but the consumer of the return value knows a difference, no externally observable effect..

The compiler will thus, according to the as-If rule, translate this:


int main()
[
    timer_start();
    for(1,000,000 times)
        expensive_pure_function();
    print(timer.stop())
}

to just that:


int main()
{
    timer.start();
    printf(timer.stop());
}

The use of volatile to consume each result prevents it from performing that optimization:


int main()
[
    double sum{};

    timer_start();
    for(1,000,000 times)
        sum += expensive_pure_function();   // no volatile here
    print(timer.stop())
    
    volatile double discard = sum;   // sink
}

Alternatively, you could just print the result (but printing a meaningless number is ugly and confusing), or either cast the accumulated result to int and return it from main, or you could omit initializing the accumulating variable. Returning the value from main is somewhat clumsy due to the cast, and will give a narrowing or "loses precision" warning, and the other solution uses an indeterminate value. Which, although in practice it most of the time "works" (the compiler cannot determine what the result will be, so it has no choice but to execute everything) is technically undefined behavior. The thing with undefined behavior is, you just never know for sure. The compiler might just as well show you the middle finger and optimize out the entire main function because of UB.

With volatile, you are telling the compiler, in an easy and well-defined way, that it is required to set the value of discard at the end (even though it's not otherwise used). It is not allowed to reorder that move, or to omit it alltogether.

In order to do that, the compiler must necessarily know the value and therefore cannot optimize out the calculation. It could, admittedly, run the complete set of calculations at compiletime, and for iteraton counts in the range below 500 or so, this indeed happens. But for iteration counts in the millions, no chance.

My problem with volatile was its use inside the loop. In this case the compiler can't rely on the fact the variable has not changed between the various iterations and it has to load&store each time the variable is used. An alternative solution, I think, can be to write to a big array.

You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

 
This has been nearly impossible to do since at least the time of the Pentium II. The only way to tell how long a piece of code takes is to run it, and even then the exact way in which you run it (cache usage, predictability of branches,  whether the result can be computed at compile time, whether the function call can be inlined...) matters a lot.
Well, there is IACA, which does just that. Unluckily, Intel wouldn't be Intel if their website actually worked and you were able to download it.
Gah... I figured it out. Intel... Khaaaaan!

Looking at where the redirection loop gets you (if you paste the link into wget, it spits out the redirections), you land on a site that requires you be logged in with an Intel developer account.

Which would be great if only someone told you! There is a page "accept EULA, download here" and when you click on the link, it just keeps redirecting your browser in an endless loop until the browser gives up.

So... the solution is simple: register an account (use a mailinator address or whatever), log in prior to going to the IACA download page, and it works just fine. :)
Time to test:
#include "iacaMarks.h"
#include <math.h>
double fsin(double x){
return 1.27323954474 * x + -0.40528473456 * x * fabs(x);
}

int main(int argc, char** argv)
{
    double d = 0.0;
    
    IACA_START
        for(int i = 1; i < 1000; ++i)
            d += fsin(argc);
    IACA_END
    
    return (int)d;
}
(drum roll)
X:\iaca\iaca-win64>gcc -O3 -march=skylake test.cpp -o test.exe && iaca -64 test.exe
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - test.exe
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 3.25 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 2.5    0.0  | 2.5  | 1.5    1.5  | 1.5    1.5  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm1, xmm1, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | mov eax, 0x3e7
|   2    |           | 1.0 |           |           |     | 1.0 |     |     |    | vcvtsi2sd xmm1, xmm1, ebx
|   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmulsd xmm0, xmm1, qword ptr [rip+0x1305]
|   0*   |           |     |           |           |     |     |     |     |    | vmovapd xmm2, xmm1
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | vandpd xmm2, xmm2, xmmword ptr [rip+0x1309]
|   1    | 0.9       | 0.1 |           |           |     |     |     |     |    | vmulsd xmm0, xmm0, xmm2
|   2^   | 0.6       | 0.4 | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vfmadd132sd xmm1, xmm0, qword ptr [rip+0x130c]
|   0*   |           |     |           |           |     |     |     |     |    | vxorpd xmm0, xmm0, xmm0
|   0*   |           |     |           |           |     |     |     |     |    | nop dword ptr [rax+rax*1], eax
|   1    |           | 1.0 |           |           |     |     |     |     |    | vaddsd xmm0, xmm0, xmm1
|   1    |           |     |           |           |     |     | 1.0 |     |    | sub eax, 0x1
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff9
Total Num Of Uops: 12
Well, coming as 100% for free, and being so darn easy to use, I'd say this is a really awesome tool. Only Skylake does not seem to be supported yet, Haswell is the latest arch this version does.

This topic is closed to new replies.

Advertisement