There is no possible way that code speeds up the calculation of sin and cos values for vectors and it introduces problem with reentrancy (it's not thread safe).

When optimizing this sort of code there are three things you must do to achieve state-of-the-art performance.

1) Ensure you are using the greatest known mathematical reduction of the algorithm

2) Eliminate all branches (even if it means more calculations)

3) Use vectorizing operations, e.g. SIMD, NEON, AVX, et. al.

Optimizing the code for vectoring operations can be very annoying.

Algorithms tend to favor separate arrays for each element/dimension as opposed to interleaved arrays which are more conveniently to deal with.

This cuts down on loading and packing time of the MD registers and that can be critical to utilizing all available computation units.

Doing the above and eliminating any IEEE-754 or C-standard overhead (e.g. if the rounding rules of the unit is different than the standards then it has to perform a conversion when storing) is how you make it fast.

The old fsincos instruction got it done in about 137 clock cycles; SSE2 and newer should have faster or more vectorized options.

If you can sacrifice accuracy then you can use an estimation of the sin and cos values and those algorithms are generally just multiplies and accumulates and you can get it done in a lot less than 100 clock cycles.