• Advertisement
Sign in to follow this  

fast sin approx algo not faster than sin() ?

This topic is 592 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello

 

I just tried this very approximative but enough for me sin approximation :

 

// from https://web.archive.org/web/20100613230051/http://www.devmaster.net/forums/showthread.php?t=5784


double fsin(double x){
return 1.27323954474 * x + -0.40528473456 * x * fabs(x);
}

 

I compared the speed to sin with a benchmark like this :

int time = SDL_GetPerformanceCounter();


double a=0;
for(int i = 0; i < 100000000; i++){
a+=fsin(a);
}
printf("fsin result : %f, time : %d\n", a, SDL_GetPerformanceCounter()-time);


time = SDL_GetPerformanceCounter();


a=0;
for(int i = 0; i < 100000000; i++){
a+=sin(a);
}


printf("sin result : %f, time : %d\n", a, SDL_GetPerformanceCounter()-time);
 

 

 

I'm quite surprised both sin and fsin performs about the same speed on my computer (desktop i7). When i turn off all optimizations on my compiler (visual studio 2013), fsin becomes 5 times faster than sin. I though that, even if sin was implemented in hardware in the CPU, it stills would be slower that this very simple fsin. Or maybe my benchmark isn't done poperly ? I ran it several times and it gave the same results.

Edited by isbinil

Share this post


Link to post
Share on other sites
Advertisement
While that's possible (very unlikely, but possible), before asking "why", one should be sure that "if". There's at least two (well, four) things that can easily be wrong in such a benchmark.

One is starting with a CPU that is in a low power state. Surprise, the first tested algorithm (which runs at 30% CPU speed) is soooo much slower than the second (which runs at 100% CPU speed). Even when you run the same test twice, this happens. Nobody can tell why. Weird.
---&amp;gt; Solution: Do some work before the first test.

The second is the fact that timers have finite resolution. If you run a test that takes a second or two, no worries. A millisecond timer will do fine, and a 100 nanosecond timer will do fine, too. But if your test run only takes, say, 15 milliseconds, and your timer has a resolution of 15.6ms (like e.g. the Windows tick counter), you may measure anything from 0 to 15 to 31 ms. That's an exaggerated example, but you see the problem.
---&amp;gt; Use a sufficiently high res timer, in comparison to the time you measure.

The third thing would be caches, but it's unlikely that you use them in this test. It's nevertheless something people fall for now and then. Run algo 1 over data, then run algo 2 over data, and surprise... algo 2 is 10 times faster. Nobody knows why!
---&amp;gt; Be sure either everybody gets to start with a warm cache, or nobody.

The fourth thing to look for (which I forgot to mention since I assumed it obvious, but maybe it's not) is that you actually use each result. Assign to a volatile after the benchmark, or return the value from main, or whatever you do. If you do nt use a value, compilers will sometimes (indeed usually) just optimize out the entire loop, even if you build without optimizations. Edited by samoth

Share this post


Link to post
Share on other sites

Thanks, i'll do more tests again this evening.

 

I think it's not the CPU state since i reversed the test order and it behaved the same. For the timer i use the the high resolution timer available on the platform (Windows -- SDL_GetPerformanceTimer which use some CPU timer I think) but maybe it's still imprecise, i'll try with more than 100 milion iterations to be sure.

Edited by isbinil

Share this post


Link to post
Share on other sites

Another thing to note is that Quake 3 came out in *1999*. CPU architecture has changed a lot, we've got all sort of magical intrinsics that compilers now actually use.

 

It could be that, yes, the standard software sin implementation of the standard library takes more time to calculate because it cares about accuracy. But the *hardware* implementation that the compiler actually uses is faster because, well, it's optimized hardware. Hard to beat that.

Share this post


Link to post
Share on other sites

That's what I guess yeah. But we still find "modern" (2012) posts that talks abouts faster sin functions ( http://www.gamedev.net/topic/621589-extremely-fast-sin-approximation/ )

 

Dunno if hardware trigonomety functions have been implemented only very recently?

There is a current thread on this: http://www.gamedev.net/topic/681723-faster-sin-and-cos/

 

Bottom line: a pretty good custom implementation (way better than yours, and about twice as much work) can be easily 10 times, if not 20 times faster than the standard library.

 

Hardware trig functions have been around since... forever (I think 386 had it already). But they have latencies in the 200 cycles range (and I think you can only start one every 200 or so cycles, too). Even traditional 387 math implementations are usually faster (and approximations anyway).

 

On modern hardware, a couple of fused-multiply-add instructions (which is what your code boils down to) have a latency of maybe a dozen or so cycles at most, and depending on how well the actual code allows pipelining to kick in, it might effectively cost you 1-2 cycles in the best case.

 

 

For the timer i use the the high resolution timer available on the platform

This is mighty fine for anthing like at least 10 million or so iterations. QPC has a resolution of about 0.3 µs on my system (which is also an i7).

Thus, you can say that anything that takes at least something-millisecond will be fine to measure with good precision, anything that is something-micro (like ten thousand thousand to a million iterations) is kinda acceptable, but precision will not be stellar, and anything that is something-nano (such as only running e.g. 5-10 iterations) is a pure bullshit measure.

 

You can time single iterations (or few) using the CPU's TSC counter, but there are three important gotchas. First, TSC may not be in sync between different cores. That's said to be mostly a "historic problem" and no longer the case, but I just had "negative time" on my current system trying this last week. Second, what you measure is processor cycles, not time. That's actually an advantage because you are no longer subject to different clock rates. If your algorithm takes 15 CPU cycles, it will take 15 cycles at 800MHz and it will still take 15 cycles at 4,000MHz (only now cycles are faster!). It's someting to be aware of, however... some people use TSC for measuring time, and that's bound to fail. Lastly, to give a precise measurement, you must completely empty the processor pipeline before taking the TSC. In other words, rather than RDTSC, you must execute CPUID followed by RDTSC (CPUID being the only synchronizing instruction that you can call in a user process).
 

Edited by samoth

Share this post


Link to post
Share on other sites

I would disassemble both functions.

 

You said they run at the same speed unless you hobble the compiler. My guess is that the compiler is smart enough to optimize it. I would think sine and cosine are common enough on computers that it would be important to optimize them and that Microsoft has spent some time doing so.

 

So, I would want to compare the machine code. I would imagine these are both only a handful of instructions. Intel used to publish manuals on the machine code on their website. I haven't looked for them in years. I assume they still do. You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

Share this post


Link to post
Share on other sites

You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

 

This has been nearly impossible to do since at least the time of the Pentium II. The only way to tell how long a piece of code takes is to run it, and even then the exact way in which you run it (cache usage, predictability of branches,  whether the result can be computed at compile time, whether the function call can be inlined...) matters a lot.

Share this post


Link to post
Share on other sites

You may be able to get the number of CPU cycles each instruction costs and explain on a machine code level why they are the speed that they are without even running the code.

&nbsp;
This has been nearly impossible to do since at least the time of the Pentium II. The only way to tell how long a piece of code takes is to run it, and even then the exact way in which you run it (cache usage, predictability of branches,&nbsp; whether the result can be computed at compile time, whether the function call can be inlined...) matters a lot.
Well, there is IACA, which does just that. Unluckily, Intel wouldn't be Intel if their website actually worked and you were able to download it.

But in theory... awesome. I used it some years ago.

Share this post


Link to post
Share on other sites

Tried to benchmark again, using a better procedure to avoid the compiler cheating, and the results seems more correct :

1473180076-tests.png

(i don't know what measurement unit it is, but it doesn't matter for comparison purposes)

 

  • 5.3 times faster than sin() but quite unprecise ,  fsin :
double fsin(double x){
return 1.27323954474 * x + -0.40528473456 * x * fabs(x);
}
  • 1.8 times faster than sin() : lspiro's sine function (converted to pure C with doubles to compare with other functions)
double lspirosin(double _fX) {
int i32I = _fX * 0.31830988618379067153776752674503f; // 1 / PI.
_fX = (_fX - i32I * 3.1415926535897932384626433832795f);


double fX2 = _fX * _fX;


return (i32I & 1) ?
-_fX * (1.0 +
fX2 * (-1.66666671633720398e-01 +
fX2 * (8.33333376795053482e-03 +
fX2 * (-1.98412497411482036e-04 +
fX2 * (2.75565571428160183e-06 +
fX2 * (-2.50368437093584362e-08 +
fX2 * (1.58846852338356825e-10 +
fX2 * -6.57978446033657960e-13))))))) :
_fX * (1.0 +
fX2 * (-1.66666671633720398e-01 +
fX2 * (8.33333376795053482e-03 +
fX2 * (-1.98412497411482036e-04 +
fX2 * (2.75565571428160183e-06 +
fX2 * (-2.50368437093584362e-08 +
fX2 * (1.58846852338356825e-10 +
fX2 * -6.57978446033657960e-13)))))));
}

My test procedure : 

printf("waking up cpu\n");
volatile double a = 0.1;


for (int i = 0; i < 10000; i++){
for (double j = -8; j < 8; j += 0.0001){
a += 0.1;
}
}
printf("test start !\n");



int time = SDL_GetPerformanceCounter() / 100;
for (int i = 0; i < 100; i++){
for (double j = -8; j < 8; j += 0.00001){
a += fsin(j);
}
}
printf("fsin : %d\n", (SDL_GetPerformanceCounter()/100-time));




time = SDL_GetPerformanceCounter() / 100;
for (int i = 0; i < 100; i++){
for (double j = -8;j < 8; j+=0.00001){
a += sin(j);
}
}


printf("standard sin : %d\n", (SDL_GetPerformanceCounter() / 100 - time));



time = SDL_GetPerformanceCounter() / 100;
for (int i = 0; i < 100; i++){
for (double j = -8; j < 8; j += 0.00001){
a += lspiro(j);
}
}


printf("lspiro sin %d\n", (SDL_GetPerformanceCounter() / 100 - time));

printf("%f\n", a);

I though there would be higher speed differences between those, or my benchmark is still flawed.

 

Compiled with MSVC2013, with all optimizations turned on.

Edited by isbinil

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement