Recently I wrote some SIMD/AVX assembly language functions that compute:
- two sines (in xmm registers)
- two cosines (in xmm registers)
- four sines (in ymm registers)
- four cosines (in ymm registers)
- four sines or cosines (in ymm registers) --- an arbitrary mix of sines and cosines
The last one takes an integer argument that specifies whether you want the sine or cosine for each argument. The input angles and results are all f64 values (64-bit "double precision" floating point).
Writing those routines was quite the learning experience, in more ways than one!!!
The main purpose was to get very good at the newest AVX/XOP/FMA* and related instructions. I haven't done a lot of completely parallel routines with SIMD instructions, and these were definitely completely parallel, since they compute the sines and cosines of up to four arguments exactly simultaneously. Believe it or not, the entire routine to compute a random mix of 4 sines and cosines has zero jmps (conditional or unconditional). The code just flows all the way through from start to finish! It was quite satisfying to find that I could write the code that way, and major congrats to the guys at AMD [and Intel] for having just the right instructions to support this kind of flow-through code.
The routines are fast, especially when you consider you're getting 4 results for the price of one. I was able to organize the instructions so the code pretty much never needed to access the result of an instruction until at least 4 instructions later, which seems to be the latency of 256-bit wide ymm instructions.
However, there is one downside. Since the results are computed in the SIMD registers, not the 80-bit wide FPU registers, there are certain angles where the results are not quite as precise as the FPU sine or sine-cosine routines. The results are always correct to 16-digits from the decimal point, but at certain angles (pi/2 if I recall correctly), where the result is almost zero, some significant bits are lost (even though the result is still correct to 16-digits from the decimal point). I was able to definitively pin down the problem to a place in the routine (and all similar cheby or taylor routines) where a subtraction of similar-size numbers occurs.
I organized my algorithm to minimize the error, but it was impossible to eliminate. Of course, most people don't give a damn about this precision loss, since they still have 16-digits of precision from the decimal point. But I'm one of those people who sweats the very last bits in some of the computations I do, for example astrodynamics (solar-system orbit computations). I made absolutely sure I knew where the problem was, by writing a version of the routines that performed the precision-losing subtracts on the mantissas alone in 64-bit integers... and that solved the "problem".
I actually have 5 variations with 7, 8, 9, 10, 11 chebyshev coefficients. The longer ones are supposed to be more precise, but it appear like 7 coefficiencts is sufficient (and is obviously a bit faster).
Anyway, if you're interested in the nitty gritty, I'll post one or two of these routines here. Or if you prefer, PM or email them to you.