What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?
Calculating object rotations/orientations.
What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?
Calculating object rotations/orientations.
Calculating object rotations/orientations.What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?
Seeing how one billion invocations take slightly under 0.4 seconds on my machine, this seems to be a very serious problem! If you have only 1ms of time budget for all your sin/cos calculations, and you can't do parallel work, you can draw no more than 2.5 million objects!
But of course, doing some pointless optimizations is never bad, I'm all for that. Faster is always better :)
Joke aside, this here:
It would be nice if the results were never larger than one
indeed seems to be less expressed in Garrett's implementation. Although I only see it in Spiro's single-precision version, the double-precision version seems to be immune to it. On one billion samples between -Pi and Pi, Spiro single-precision has 380 hits, Adam_42 single-precision even has 2030 (why???), all others have zero hits.
So, as long as you use the double version, you should be good, no values greater one (you can probably solve that equation to see exactly which values, if any, will provide outputs greater than one, but I'm too lazy for that and too many years outa school... leaving that as homework for someone else, hehehe... a billion samples and zero hits is good enough for me!).
With what compiler settings?But double precision runs faster than single precision version, anyway
All implementations of sin()/cos() on digital machinery are approximations, and necessarily there does not exist a case in which being slightly above 1.0 causes a problem, particularly in real-time situations.It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?
What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?
Calculating object rotations/orientations.
With any setting, funnily. Complete timings with and without --fast-math (also with SSE and 387 math, and combi-mode) are 7 posts above. Double is faster every time.
With what compiler settings?But double precision runs faster than single precision version, anyway
X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.
One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here?
A couple of updates. First, on why double precision is faster than single...
vmulss %xmm0,%xmm0,%xmm1
vcvtss2sd %xmm0,%xmm0,%xmm0
vcvtss2sd %xmm1,%xmm1,%xmm1
vfmadd213sd 0x8b3b(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b3a(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b39(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b38(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b37(%rip),%xmm2,%xmm1
vmulsd %xmm0,%xmm1,%xmm0
vcvtsd2ss %xmm0,%xmm0,%xmm0 <-------- aha!
Yep, that's why. Both do the same, except the last line.
Second, having looked at the disassembly now, I'm shocked that GCC doesn't even inline a single of these functions! It actually does function calls! Why?
Well, I'm using a function that takes a functor like this:
template<typename F> auto test(F f, [blah blah])
{
...
qpc.start();
for(...)
sum += f(t);
qpc.stop();
volatile double discard = sum;
return qpc;
}
The assumption is that, of course, the compiler will inline that function pointer to a simple three-liner since it's being called on a darn constant expression. Which, of course, the compiler can see immediately, just like it can see immediately that the function is rather trivial and easily inlineable. Guess what, it doesn't. However, changing the function call to a lambda will work just fine. Tell me about being unlogical. Grrr...
On the positive side, this means that the custom functions are really faster because they're faster, not because the compiler inlines them and doesn't inline the library call.
Now, trying to get GCC to auto-vectorize this over an array of 10k doubles doesn't seem to work. Even if you "help" it and manually unroll the loop 4 times, it just generates 4 times the scalar code. Oh well, was worth trying.