simd sine?
recently I was SIMD-izing a source, and I ran into matrix rotation which uses sine/cosine functions. Well, on the FPU those are easy, just use fsincos, but what would would be the fastest/most effective way to get the sine and cosine using the simd functions? I implemented Taylor''s polynomial(?) for a trial run... and I noticed with my code (see below) there is a little bit of inaccuracy (especially when using singles)
Has anyone here ever done this before? I''m not even sure this is a "optimum" solution considering the inaccuracies.
I estimate: ~42 clocks to sine 4 packed singles or 2 packed doubles... But with those inaccuracies, it might not be worth it.
BTW, this is testing code, so everything uses single singles and no pipelining. Nor do I check to see the bounds of x to be sure it''s between 2pi and 0... Also, it''s in SpAsm syntax, not intel syntax
; ------------ 8< --------------
[ff6r: F$6.0]
[ff120r: F$120.0]
[ff5040r: F$5040.0]
[sinex: F$0.5]
; initialize fpu constants:
fld F$ff6r | fld1 | fdivrp
fld F$ff120r | fld1 | fdivrp
fld F$ff5040r | fld1 | fdivrp
fstp F$ff5040r | fstp F$ff120r | fstp F$ff6r
; formula:
; x-((1/6)*x^3)+((1/120)*x^5)-((1/5040)*x^7)
; assume single single for simplicity:
movss xmm0 X$sinex | movss xmm1 X$sinex
mulss xmm1 xmm0 | mulss xmm1 xmm0 ; xmm1 = x^3
movss xmm2 xmm1 ; copy it to xmm2
mulss xmm1 X$ff6r
mulss xmm2 xmm0 | mulss xmm2 xmm0 ; xmm1 = x^7
movss xmm3 xmm1 ; copy it to xmm3
mulss xmm2 X$ff120r
mulss xmm3 xmm0 | addss xmm1 xmm2 ; xmm1 = (1/6)*x^3)+((1/120)*x^5)
mulss xmm3 xmm0 ; xmm3 = x^7
mulss xmm3 X$ff5040r | addss xmm1 xmm3 ; xmm1 = ((1/6)*x^3)+((1/120)*x^5)-((1/5040)*x^7)
subss xmm0 xmm1
dbgxmm
; ------------ 8< --------------
Download amd''s maths library for 3dnow, see how they did it, and rewrite for SSE. I assume that amd would know what they are doing, and it is 100% accurate.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement