Faster Sin and Cos

Started by
37 comments, last by alvaro 7 years, 7 months ago

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?

Calculating object rotations/orientations.

Advertisement

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?

Calculating object rotations/orientations.

Seeing how one billion invocations take slightly under 0.4 seconds on my machine, this seems to be a very serious problem! If you have only 1ms of time budget for all your sin/cos calculations, and you can't do parallel work, you can draw no more than 2.5 million objects!

But of course, doing some pointless optimizations is never bad, I'm all for that. Faster is always better :)

Joke aside, this here:

It would be nice if the results were never larger than one

indeed seems to be less expressed in Garrett's implementation. Although I only see it in Spiro's single-precision version, the double-precision version seems to be immune to it. On one billion samples between -Pi and Pi, Spiro single-precision has 380 hits, Adam_42 single-precision even has 2030 (why???), all others have zero hits.

So, as long as you use the double version, you should be good, no values greater one (you can probably solve that equation to see exactly which values, if any, will provide outputs greater than one, but I'm too lazy for that and too many years outa school... leaving that as homework for someone else, hehehe... a billion samples and zero hits is good enough for me!).

But double precision runs faster than single precision version, anyway

With what compiler settings?
X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.

It would be nice if the results were never larger than one (e.g., Sin(1.57083237171173f) gives me 1.00000011920929f). Can the coefficient optimization be constrained by that?

All implementations of sin()/cos() on digital machinery are approximations, and necessarily there does not exist a case in which being slightly above 1.0 causes a problem, particularly in real-time situations.
The simple act of normalizing a vector using the standard sinf() and cosf() (or even the double-sized sin() and cos()) functions creates vectors with a length greater than 1.0 approximately half of the time.


Although I could add a further constraint on my evaluations such that I determine the most accurate constants that never result in a value above 1.0, the point of these approximations is that you are not so strict so that you can gain speed.

Eventually I will get to the 21st-degree polynomials. I will add an extra constraint then, because by that point you are more concerned with accuracy than performance, as 21st-degree polynomials tend to perform poorer than his paper suggests on multiple hardwares, including the current consoles.

Meanwhile, although my float values have been evaluated in double-sized algorithms, I have determined double constants that provide an even-better approximation for a 15th-degree polynomial.
I will post those soon.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

What are people doing (in particular in games) that makes the calls to sin and cos take a noticeable chunk of time?


Calculating object rotations/orientations.


Do people really do that? 3D rotations are easily encoded using unit-length quaternions, and 2D rotations using unit-length complex numbers. If you use those, you basically don't need any trigonometric functions for rotations/orientations, just multiply-adds. I thought that's what everyone would be doing.

Someone put this blanket in the dryer.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

But double precision runs faster than single precision version, anyway

With what compiler settings?
X86 compilers tend to produce retarded float code unless you use the fast math option to opt out of IEEE 32but strictness.
With any setting, funnily. Complete timings with and without --fast-math (also with SSE and 387 math, and combi-mode) are 7 posts above. Double is faster every time.
Must be that all calculations are double anyway (sure are for 387, but I would have assumed differently for SSE), and single precision forces an explicit conversion somewhere. Whatever it is, we're talking like 7.5% on an operation in the single-nanosecond range, so really no biggie.

Might be more visible on a single call. Maybe it's woth RDTSCing a single call to see how many cycles go into it. I assume there is some really hefty pipelining involved in that benchmark because if you break down the time measured by QPC and number of iterations, you get a number that's around 5-6 or so clock cycles for the fastest method. Which is obviously much faster than this chain of multiplications and additions can possibly be. Unless QPC is lying to me... but wall clock says something very similar. Or very deep pipelining.

But in a real application where it matters (because you do hundred thousands of calls in a batch), you will have that very same pipelining, so I guess the benchmark is not "wrong" as such. If you only have to calculate a handful of sines, who cares anyway.
What's the critical path? (I'm far too lazy rn.) There's a lot of parallelizable multiplication there. If it's flooding the available multipliers then it's likely that near future hardware (with more mul units) will give a greater improvement.

...

How many multipliers do modern cores have now anyway? Or are FPUs monolithic?
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

One question, why put all those muls/adds in that last branch if they (seem) to be the same for both paths? (except for the "- x *" and "x *" parts). Or I am missing something here?

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

A couple of updates. First, on why double precision is faster than single...


vmulss %xmm0,%xmm0,%xmm1
vcvtss2sd %xmm0,%xmm0,%xmm0
vcvtss2sd %xmm1,%xmm1,%xmm1
vfmadd213sd 0x8b3b(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b3a(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b39(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b38(%rip),%xmm1,%xmm2
vfmadd213sd 0x8b37(%rip),%xmm2,%xmm1
vmulsd %xmm0,%xmm1,%xmm0
vcvtsd2ss %xmm0,%xmm0,%xmm0  <-------- aha!

Yep, that's why. Both do the same, except the last line.

Second, having looked at the disassembly now, I'm shocked that GCC doesn't even inline a single of these functions! It actually does function calls! Why?
Well, I'm using a function that takes a functor like this:


template<typename F> auto test(F f, [blah blah])
{
    ...
    qpc.start();
        for(...)
            sum += f(t);
    qpc.stop();

    volatile double discard = sum;

    return qpc;
}

The assumption is that, of course, the compiler will inline that function pointer to a simple three-liner since it's being called on a darn constant expression. Which, of course, the compiler can see immediately, just like it can see immediately that the function is rather trivial and easily inlineable. Guess what, it doesn't. However, changing the function call to a lambda will work just fine. Tell me about being unlogical. Grrr...

On the positive side, this means that the custom functions are really faster because they're faster, not because the compiler inlines them and doesn't inline the library call.

Now, trying to get GCC to auto-vectorize this over an array of 10k doubles doesn't seem to work. Even if you "help" it and manually unroll the loop 4 times, it just generates 4 times the scalar code. Oh well, was worth trying.

This topic is closed to new replies.

Advertisement