What is more expensive in float?

Started by
8 comments, last by JohnnyCode 7 years, 12 months ago

I wonder which one would be faster, of those opretion sets on computers, in float or double numbers:

1- four goniometric functions

vs

2- one square root and 1 inverse cosinus function

Advertisement

(I am assuming you mean trigonometric functions instead of goniometric functions)

Anyway, the correct answer is: Profile your code and test it yourself. My guess would be that the square root and one trigonometric function might be slightly faster, but I don't know which CPU you are using, so YMMV.

<insert obligatory rant about premature optimization here, blablabla>

My guess would be that the square root and one trigonometric function might be slightly faster, but I don't know which CPU you are using, so YMMV.

<insert obligatory rant about premature optimization here, blablabla>

Thanks for the guess tho, it is a cpu standard op, who knows of what advanced math lib, wheather intel or amd native ops, I believe they should not differ on this?

I think otherwise, inverse number is so expensive, while trigonometric functions can be aproximated by tylor polynomes in few degrees so well. I am going to profile, but I hope for more guesses :)

Thanks.

My guess would be that the square root and one trigonometric function might be slightly faster, but I don't know which CPU you are using, so YMMV.

<insert obligatory rant about premature optimization here, blablabla>

Thanks for the guess tho, it is a cpu standard op, who knows of what advanced math lib, wheather intel or amd native ops, I believe they should not differ on this?
Sines/cosines are standard maths functions. It is possible to Improve on the standard library implementations (depending on how much accuracy you are willing to sacrifice). So yeah, if you can find a way to reduce N cmath calls by one or more, then it's typically a good thing. I would be semi-inclined to make the switch from 4 funcs to 1 + sqrt without bothering to profile. It's highly unlikely that it will be slower (and that code can change between platforms, so an improvement on one platform might not be better on another).

*IF* fast math is enabled in your compiler settings, then sqrt is a CPU instruction (if using strict or precise, then typically a standard library function will be used).

I think otherwise, inverse number is so expensive, while trigonometric functions can be aproximated by tylor polynomes in few degrees so well. I am going to profile, but I hope for more guesses :)
Thanks.

There are some fairly decent arc-cos / arc-sin approximations around. Certainly the approximations I use aren't substantially worse than the non inverse functions.

Worth reading these:

http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648
https://www.ecse.rpi.edu/~wrf/Research/Short_Notes/arcsin/onlyelem.html

Generally speaking, using floats will be quicker than double, but YMMV. (Long topic, so I'll leave that can of worms shut for now)

What are you guys doing where trigonometric functions are used so heavily as to make a difference in performance?

Álvaro, on 18 Apr 2016 - 10:39 PM, said :
What are you guys doing where trigonometric functions are used so heavily as to make a difference in performance ?
This is awfully dismissive, but go on then, I'll bite....
Let's write a rubbish function in C++
[source]
__declspec(dllexport) void func1(float* out, const float* a, const float* b, const uint32_t n)
{
for (uint32_t i = 0; i < n; ++i)
{
out = a + b;
}
}
[/source]
I'm exporting the func so that the compiler doesn't simply strip out the method. Any reasonable compiler will be able to happily replace that code with some tasty SIMD goodness, so let's quickly check using something such as Visual C++ 2015...
[source]
$LL4@func1:
vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]
vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
lea eax, DWORD PTR[r10 + 8]
vmovups YMMWORD PTR[r11 + r10 * 4], ymm1
vmovups ymm1, YMMWORD PTR[rdx + rax * 4]
vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]
add r10d, 16
vmovups YMMWORD PTR[r11 + rax * 4], ymm1
cmp r10d, ebx
jb SHORT $LL4@func1
[/source]
Note: I've removed some detail from the asm here. VC++ unrolls the loop into 16 floats, and adds some extra code to handle the last elements (up to 15 - which I've removed). I'm only concerned with the innermost loop here!
So that loop boils down to this :
vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
an AVX addition, which is using YMM registers(i.e. 8 floats at a time).Good.
Now let's make a minor change to that code:
[source]
__declspec(dllexport) void func2(float* out, const float* a, const float* b, const uint32_t n)
{
for (uint32_t i = 0; i < n; ++i)
{
out = std::sin(a + b);
}
}
[/source]
So, let's take a look at what that spits out....
[source]
$LL4@func2:
vmovups ymm1, YMMWORD PTR[r14 + rsi * 4]
vaddps ymm0, ymm1, YMMWORD PTR[r12 + rsi * 4]
call __vdecl_sinf8
vmovups YMMWORD PTR[rdi + rsi * 4], ymm0
add esi, 8
cmp esi, r13d
jb SHORT $LL4@func2
[/source]
That's actually not *too* bad, it's inserted a nice SIMD function call here [__vdecl_sinf8] (Some older compilers would actually fail to SIMD this code at all, and end up doing 1 float at a time). The performance here though, depends on the implementation of __vdecl_sinf8, but we'll get to that later.
So let's look at the implementation I posted above:
[source]
inline float sine(const float x)
{
const float pi = 3.1415926535897932384626433832795f;
const float B = 4.0f / pi;
const float C = -4.0f / (pi * pi);
float y = B * x + C * x * abs(x);
const float P = 0.225f;
y = P * (y * std::abs(y) - y) + y;
return y;
}
__declspec(dllexport) void func3(float* out, const float* a, const float* b, const uint32_t n)
{
for (uint32_t i = 0; i < n; ++i)
{
out = sine(a + b);
}
}
[/source]
A quick look at the inner loop of this approach, and we have:
[source]
$LL4@func3:
vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]
vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
vmovups ymm2, ymm1
vmulps ymm1, ymm6, ymm1
vmulps ymm4, ymm7, ymm2
vandps ymm0, ymm2, ymm5
vfnmadd231ps ymm1, ymm0, ymm4
vandps ymm2, ymm1, ymm5
vmovups ymm3, ymm1
vfmsub231ps ymm1, ymm2, ymm1
vfmadd231ps ymm3, ymm1, ymm8
vmovups YMMWORD PTR[r11 + r10 * 4], ymm3
lea eax, DWORD PTR[r10 + 8]
add r10d, 16
vmovups ymm1, YMMWORD PTR[rdx + rax * 4]
vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]
vmovups ymm2, ymm1
vmulps ymm4, ymm2, ymm7
vandps ymm0, ymm2, ymm5
vmulps ymm1, ymm1, ymm6
vfnmadd231ps ymm1, ymm0, ymm4
vmovups ymm3, ymm1
vandps ymm2, ymm1, ymm5
vfmsub231ps ymm1, ymm2, ymm1
vfmadd231ps ymm3, ymm1, ymm8
vmovups YMMWORD PTR[r11 + rax * 4], ymm3
cmp r10d, ebx
jb SHORT $LL4@func3
[/source]
You'll notice it's unrolled the loop again, so the actual sine approximation boils down to:
[source]
vmovups ymm2, ymm1
vmulps ymm1, ymm6, ymm1
vmulps ymm4, ymm7, ymm2
vandps ymm0, ymm2, ymm5
vfnmadd231ps ymm1, ymm0, ymm4
vandps ymm2, ymm1, ymm5
vmovups ymm3, ymm1
vfmsub231ps ymm1, ymm2, ymm1
vfmadd231ps ymm3, ymm1, ymm8
vmovups YMMWORD PTR[r11 + r10 * 4], ymm3
[/source]
Which isn't too bad at all (some nice tasty FMA goodness in there). One obvious problem is that it doesn't like numbers in outside the -Pi to +PI range. That may end up adding an extra 5 or so ops to fmod the argument so it fits nicely into that range (not really a biggy with AVX).
Right then, back to __vdecl_sinf8. Stepping through into the method with a disassembler, I end up facing this:
[source]
00007FF7C8CF7E00 push rsi
00007FF7C8CF7E01 push r14
00007FF7C8CF7E03 push r15
00007FF7C8CF7E05 sub rsp,2C0h
00007FF7C8CF7E0C xor esi,esi
00007FF7C8CF7E0E vmovups ymmword ptr [rsp+2A0h],ymm15
00007FF7C8CF7E17 vmovups ymmword ptr [rsp+260h],ymm14
00007FF7C8CF7E20 vmovups ymmword ptr [rsp+220h],ymm13
00007FF7C8CF7E29 vmovups ymmword ptr [rsp+240h],ymm12
00007FF7C8CF7E32 vmovups ymmword ptr [rsp+1C0h],ymm11
00007FF7C8CF7E3B vmovups ymmword ptr [rsp+1E0h],ymm10
00007FF7C8CF7E44 vmovups ymmword ptr [rsp+1A0h],ymm9
00007FF7C8CF7E4D vmovups ymmword ptr [rsp+280h],ymm8
00007FF7C8CF7E56 vmovups ymmword ptr [rsp+180h],ymm7
00007FF7C8CF7E5F vmovups ymmword ptr [rsp+200h],ymm6
00007FF7C8CF7E68 vpxor xmm7,xmm7,xmm7
00007FF7C8CF7E6C mov qword ptr [rsp+0D8h],r13
00007FF7C8CF7E74 lea r13,[rsp+11Fh]
00007FF7C8CF7E7C vmovups ymm6,ymmword ptr [__common_ssin_data+1000h (07FF7C8D02C00h)]
00007FF7C8CF7E84 and r13,0FFFFFFFFFFFFFFC0h
00007FF7C8CF7E88 vandps ymm2,ymm0,ymm6
00007FF7C8CF7E8C vcmpgt_oqps ymm1,ymm2,ymmword ptr [__common_ssin_data+1040h (07FF7C8D02C40h)]
00007FF7C8CF7E95 vextractf128 xmm3,ymm1,1
00007FF7C8CF7E9B vpackssdw xmm4,xmm1,xmm3
00007FF7C8CF7E9F vpacksswb xmm5,xmm4,xmm7
00007FF7C8CF7EA3 vpmovmskb eax,xmm5
00007FF7C8CF7EA7 test al,al
00007FF7C8CF7EA9 jne __avx_sinf8+219h (07FF7C8CF8019h)
_B1_2:
00007FF7C8CF7EAF vmovups ymm8,ymmword ptr [__common_ssin_data+14C0h (07FF7C8D030C0h)]
00007FF7C8CF7EB7 vmulps ymm3,ymm2,ymmword ptr [__common_ssin_data+1480h (07FF7C8D03080h)]
00007FF7C8CF7EBF vaddps ymm4,ymm3,ymm8
00007FF7C8CF7EC4 vandnps ymm1,ymm6,ymm0
00007FF7C8CF7EC8 vpslld xmm5,xmm4,1Fh
00007FF7C8CF7ECD vsubps ymm13,ymm4,ymm8
00007FF7C8CF7ED2 vmulps ymm9,ymm13,ymmword ptr [__common_ssin_data+11C0h (07FF7C8D02DC0h)]
00007FF7C8CF7EDA vmulps ymm10,ymm13,ymmword ptr [__common_ssin_data+1200h (07FF7C8D02E00h)]
00007FF7C8CF7EE2 vmulps ymm12,ymm13,ymmword ptr [__common_ssin_data+1240h (07FF7C8D02E40h)]
00007FF7C8CF7EEA vmulps ymm15,ymm13,ymmword ptr [__common_ssin_data+1280h (07FF7C8D02E80h)]
00007FF7C8CF7EF2 vsubps ymm2,ymm2,ymm9
00007FF7C8CF7EF7 vsubps ymm11,ymm2,ymm10
00007FF7C8CF7EFC vsubps ymm14,ymm11,ymm12
00007FF7C8CF7F01 vsubps ymm2,ymm14,ymm15
00007FF7C8CF7F06 vmulps ymm10,ymm2,ymm2
00007FF7C8CF7F0A vextractf128 xmm6,ymm4,1
00007FF7C8CF7F10 vmulps ymm4,ymm10,ymmword ptr [__common_ssin_data+1440h (07FF7C8D03040h)]
00007FF7C8CF7F18 vpslld xmm7,xmm6,1Fh
00007FF7C8CF7F1D vinsertf128 ymm3,ymm5,xmm7,1
00007FF7C8CF7F23 vaddps ymm5,ymm4,ymmword ptr [__common_ssin_data+1400h (07FF7C8D03000h)]
00007FF7C8CF7F2B vmulps ymm6,ymm5,ymm10
00007FF7C8CF7F30 vaddps ymm7,ymm6,ymmword ptr [__common_ssin_data+13C0h (07FF7C8D02FC0h)]
00007FF7C8CF7F38 vmulps ymm8,ymm7,ymm10
00007FF7C8CF7F3D vaddps ymm9,ymm8,ymmword ptr [__common_ssin_data+1380h (07FF7C8D02F80h)]
00007FF7C8CF7F45 vmulps ymm11,ymm9,ymm10
00007FF7C8CF7F4A vxorps ymm13,ymm2,ymm3
00007FF7C8CF7F4E vmulps ymm12,ymm11,ymm13
00007FF7C8CF7F53 vaddps ymm14,ymm12,ymm13
00007FF7C8CF7F58 vxorps ymm1,ymm14,ymm1
_B1_3:
00007FF7C8CF7F5C test sil,sil
00007FF7C8CF7F5F jne __avx_sinf8+1D4h (07FF7C8CF7FD4h)
_B1_4:
00007FF7C8CF7F61 vmovups ymm6,ymmword ptr [rsp+200h]
00007FF7C8CF7F6A vmovups ymm7,ymmword ptr [rsp+180h]
00007FF7C8CF7F73 vmovups ymm8,ymmword ptr [rsp+280h]
00007FF7C8CF7F7C vmovups ymm9,ymmword ptr [rsp+1A0h]
00007FF7C8CF7F85 vmovups ymm10,ymmword ptr [rsp+1E0h]
00007FF7C8CF7F8E vmovups ymm11,ymmword ptr [rsp+1C0h]
00007FF7C8CF7F97 vmovups ymm12,ymmword ptr [rsp+240h]
00007FF7C8CF7FA0 vmovups ymm13,ymmword ptr [rsp+220h]
00007FF7C8CF7FA9 vmovups ymm14,ymmword ptr [rsp+260h]
00007FF7C8CF7FB2 vmovups ymm15,ymmword ptr [rsp+2A0h]
00007FF7C8CF7FBB mov r13,qword ptr [rsp+0D8h]
00007FF7C8CF7FC3 vmovaps ymm0,ymm1
00007FF7C8CF7FC7 add rsp,2C0h
00007FF7C8CF7FCE pop r15
00007FF7C8CF7FD0 pop r14
00007FF7C8CF7FD2 pop rsi
00007FF7C8CF7FD3 ret
_B1_5:
00007FF7C8CF7FD4 vmovups ymmword ptr [r13],ymm0
00007FF7C8CF7FDA vmovups ymmword ptr [r13+40h],ymm1
00007FF7C8CF7FE0 test esi,esi
00007FF7C8CF7FE2 je __avx_sinf8+161h (07FF7C8CF7F61h)
_B1_7:
00007FF7C8CF7FE8 xor r14d,r14d
_B1_8:
00007FF7C8CF7FEB bt esi,r14d
00007FF7C8CF7FEF jb __avx_sinf8+205h (07FF7C8CF8005h)
_B1_9:
00007FF7C8CF7FF1 inc r14d
00007FF7C8CF7FF4 cmp r14d,20h
00007FF7C8CF7FF8 jl __avx_sinf8+1EBh (07FF7C8CF7FEBh)
_B1_10:
00007FF7C8CF7FFA vmovups ymm1,ymmword ptr [r13+40h]
00007FF7C8CF8000 jmp __avx_sinf8+161h (07FF7C8CF7F61h)
_B1_11:
00007FF7C8CF8005 vzeroupper
00007FF7C8CF8008 lea rcx,[r13+r14*4]
00007FF7C8CF800D lea rdx,[r13+r14*4+40h]
00007FF7C8CF8012 call __common_ssin_cout_rare (07FF7C8CF88E0h)
00007FF7C8CF8017 jmp __avx_sinf8+1F1h (07FF7C8CF7FF1h)
_B1_12:
00007FF7C8CF8019 vmovups ymm10,ymmword ptr [__common_ssin_data+1080h (07FF7C8D02C80h)]
00007FF7C8CF8021 mov edx,7F800000h
00007FF7C8CF8026 vmovups ymmword ptr [r13],ymm0
00007FF7C8CF802C vmovd xmm8,edx
00007FF7C8CF8030 vpshufd xmm13,xmm8,0
00007FF7C8CF8036 vandps ymm6,ymm10,ymm2
00007FF7C8CF803A mov edx,0FFh
00007FF7C8CF803F vcmpeqps ymm1,ymm6,ymm10
00007FF7C8CF8045 lea rax,[__common_ssin_reduction_data (07FF7C8D01000h)]
00007FF7C8CF804C vpand xmm4,xmm13,xmm0
00007FF7C8CF8050 vextractf128 xmm15,ymm0,1
00007FF7C8CF8056 vpsrld xmm14,xmm4,17h
00007FF7C8CF805B vpand xmm9,xmm13,xmm15
00007FF7C8CF8060 vpslld xmm12,xmm14,1
00007FF7C8CF8066 vpsrld xmm11,xmm9,17h
00007FF7C8CF806C vpaddd xmm5,xmm12,xmm14
00007FF7C8CF8071 vpslld xmm6,xmm11,1
00007FF7C8CF8077 vpaddd xmm10,xmm6,xmm11
00007FF7C8CF807C vpslld xmm9,xmm10,2
00007FF7C8CF8082 vmovd r14d,xmm9
00007FF7C8CF8087 vmovups xmmword ptr [rsp+20h],xmm0
00007FF7C8CF808D vmovups xmmword ptr [rsp+30h],xmm15
00007FF7C8CF8093 vpextrd r15d,xmm9,1
00007FF7C8CF8099 vpextrd esi,xmm9,2
00007FF7C8CF809F vextractf128 xmm3,ymm1,1
00007FF7C8CF80A5 vpackssdw xmm2,xmm1,xmm3
00007FF7C8CF80A9 vpslld xmm1,xmm5,2
00007FF7C8CF80AE vpacksswb xmm7,xmm2,xmm7
00007FF7C8CF80B2 vpmovmskb ecx,xmm7
00007FF7C8CF80B6 vmovd r8d,xmm1
00007FF7C8CF80BB vmovd xmm12,dword ptr [r14+rax]
00007FF7C8CF80C1 vmovd xmm14,dword ptr [r15+rax]
00007FF7C8CF80C7 mov dword ptr [rsp+0D0h],ecx
00007FF7C8CF80CE vpextrd ecx,xmm9,3
00007FF7C8CF80D4 vpextrd r10d,xmm1,2
00007FF7C8CF80DA vpextrd r11d,xmm1,3
00007FF7C8CF80E0 vpextrd r9d,xmm1,1
00007FF7C8CF80E6 vmovd xmm5,dword ptr [rsi+rax]
00007FF7C8CF80EB vmovd xmm6,dword ptr [rcx+rax]
00007FF7C8CF80F0 vmovd xmm7,dword ptr [r10+rax]
00007FF7C8CF80F6 vmovd xmm8,dword ptr [r11+rax]
00007FF7C8CF80FC vpunpcklqdq xmm11,xmm12,xmm14
00007FF7C8CF8101 vpunpcklqdq xmm10,xmm5,xmm6
00007FF7C8CF8105 vmovd xmm5,dword ptr [rsi+rax+4]
00007FF7C8CF810B vmovd xmm6,dword ptr [rcx+rax+4]
00007FF7C8CF8111 vpunpcklqdq xmm13,xmm7,xmm8
00007FF7C8CF8116 vshufps xmm8,xmm11,xmm10,88h
00007FF7C8CF811C vmovd xmm12,dword ptr [r14+rax+4]
00007FF7C8CF8123 vmovd xmm14,dword ptr [r15+rax+4]
00007FF7C8CF812A vpunpcklqdq xmm10,xmm5,xmm6
00007FF7C8CF812E vmovd xmm5,dword ptr [r15+rax+8]
00007FF7C8CF8135 mov r15d,7FFFFFh
00007FF7C8CF813B vmovd xmm3,dword ptr [r8+rax]
00007FF7C8CF8141 vmovd xmm2,dword ptr [r9+rax]
00007FF7C8CF8147 vpunpcklqdq xmm11,xmm12,xmm14
00007FF7C8CF814C vmovd xmm14,dword ptr [r14+rax+8]
00007FF7C8CF8153 mov r14d,800000h
00007FF7C8CF8159 vpunpcklqdq xmm4,xmm3,xmm2
00007FF7C8CF815D vmovd xmm1,dword ptr [r8+rax+4]
00007FF7C8CF8164 vmovd xmm3,dword ptr [r9+rax+4]
00007FF7C8CF816B vmovd xmm2,dword ptr [r10+rax+4]
00007FF7C8CF8172 vmovd xmm7,dword ptr [r11+rax+4]
00007FF7C8CF8179 vshufps xmm13,xmm4,xmm13,88h
00007FF7C8CF817F vpunpcklqdq xmm4,xmm1,xmm3
00007FF7C8CF8183 vpunpcklqdq xmm9,xmm2,xmm7
00007FF7C8CF8187 vmovd xmm7,dword ptr [r11+rax+8]
00007FF7C8CF818E mov r11d,0FFFFh
00007FF7C8CF8194 vmovd xmm1,dword ptr [r8+rax+8]
00007FF7C8CF819B mov r8d,47400000h
00007FF7C8CF81A1 vmovd xmm3,dword ptr [r9+rax+8]
00007FF7C8CF81A8 mov r9d,3F800000h
00007FF7C8CF81AE vmovd xmm2,dword ptr [r10+rax+8]
00007FF7C8CF81B5 mov r10d,80000000h
00007FF7C8CF81BB vshufps xmm4,xmm4,xmm9,88h
00007FF7C8CF81C1 vpunpcklqdq xmm9,xmm1,xmm3
00007FF7C8CF81C5 vpunpcklqdq xmm12,xmm2,xmm7
00007FF7C8CF81C9 vmovd xmm7,r15d
00007FF7C8CF81CE vshufps xmm1,xmm9,xmm12,88h
00007FF7C8CF81D4 vmovd xmm9,r14d
00007FF7C8CF81D9 vpshufd xmm12,xmm7,0
00007FF7C8CF81DE mov r15d,28800000h
00007FF7C8CF81E4 vpunpcklqdq xmm3,xmm14,xmm5
00007FF7C8CF81E8 vpand xmm0,xmm12,xmm0
00007FF7C8CF81EC vpshufd xmm5,xmm9,0
00007FF7C8CF81F2 vpand xmm15,xmm12,xmm15
00007FF7C8CF81F7 vshufps xmm10,xmm11,xmm10,88h
00007FF7C8CF81FD vpaddd xmm14,xmm0,xmm5
00007FF7C8CF8201 vmovd xmm6,dword ptr [rsi+rax+8]
00007FF7C8CF8207 vmovd xmm0,r11d
00007FF7C8CF820C vmovd xmm11,dword ptr [rcx+rax+8]
00007FF7C8CF8212 lea rax,[__common_ssin_data (07FF7C8D01C00h)]
00007FF7C8CF8219 mov r14d,3FFFFh
00007FF7C8CF821F vpunpcklqdq xmm2,xmm6,xmm11
00007FF7C8CF8224 vpaddd xmm6,xmm15,xmm5
00007FF7C8CF8228 vpshufd xmm15,xmm0,0
00007FF7C8CF822D vpsrld xmm11,xmm4,10h
00007FF7C8CF8232 vpand xmm7,xmm13,xmm15
00007FF7C8CF8237 vpand xmm9,xmm14,xmm15
00007FF7C8CF823C vmovups xmmword ptr [rsp+50h],xmm8
00007FF7C8CF8242 vpand xmm12,xmm8,xmm15
00007FF7C8CF8247 vshufps xmm3,xmm3,xmm2,88h
00007FF7C8CF824C vpsrld xmm8,xmm10,10h
00007FF7C8CF8252 vpand xmm0,xmm10,xmm15
00007FF7C8CF8257 vpsrld xmm2,xmm1,10h
00007FF7C8CF825C vpsrld xmm10,xmm14,10h
00007FF7C8CF8262 vpand xmm14,xmm6,xmm15
00007FF7C8CF8267 vmovdqu xmmword ptr [rsp+60h],xmm7
00007FF7C8CF826D vpand xmm5,xmm4,xmm15
00007FF7C8CF8272 vpmulld xmm7,xmm9,xmm7
00007FF7C8CF8277 vpand xmm1,xmm1,xmm15
00007FF7C8CF827C vmovdqu xmmword ptr [rsp+0B0h],xmm7
00007FF7C8CF8285 vpsrld xmm4,xmm3,10h
00007FF7C8CF828A vpmulld xmm7,xmm10,xmm2
00007FF7C8CF828F vpand xmm3,xmm3,xmm15
00007FF7C8CF8294 vpmulld xmm2,xmm9,xmm2
00007FF7C8CF8299 mov r11d,34000000h
00007FF7C8CF829F vmovups xmmword ptr [rsp+40h],xmm13
00007FF7C8CF82A5 vpsrld xmm13,xmm6,10h
00007FF7C8CF82AA vpmulld xmm6,xmm14,xmm12
00007FF7C8CF82AF vpsrld xmm2,xmm2,10h
00007FF7C8CF82B4 vmovdqu xmmword ptr [rsp+70h],xmm12
00007FF7C8CF82BA vpaddd xmm7,xmm7,xmm2
00007FF7C8CF82BE vmovdqu xmmword ptr [rsp+90h],xmm8
00007FF7C8CF82C7 mov ecx,0B795777Ah
00007FF7C8CF82CC vmovdqu xmmword ptr [rsp+0A0h],xmm0
00007FF7C8CF82D5 mov esi,7FFFFFFFh
00007FF7C8CF82DA vmovdqu xmmword ptr [rsp+0C0h],xmm6
00007FF7C8CF82E3 vpmulld xmm12,xmm14,xmm8
00007FF7C8CF82E8 vpmulld xmm6,xmm9,xmm5
00007FF7C8CF82ED vpmulld xmm8,xmm14,xmm0
00007FF7C8CF82F2 vpmulld xmm0,xmm10,xmm1
00007FF7C8CF82F7 vpand xmm2,xmm8,xmm15
00007FF7C8CF82FC vpsrld xmm1,xmm0,10h
00007FF7C8CF8301 vpand xmm0,xmm6,xmm15
00007FF7C8CF8306 vpaddd xmm0,xmm0,xmm7
00007FF7C8CF830A vpsrld xmm6,xmm6,10h
00007FF7C8CF830F vpaddd xmm7,xmm1,xmm0
00007FF7C8CF8313 vpsrld xmm8,xmm8,10h
00007FF7C8CF8319 vpmulld xmm0,xmm13,xmm3
00007FF7C8CF831E vpmulld xmm3,xmm13,xmm4
00007FF7C8CF8323 vpsrld xmm0,xmm0,10h
00007FF7C8CF8328 vpmulld xmm4,xmm14,xmm4
00007FF7C8CF832D vpsrld xmm1,xmm4,10h
00007FF7C8CF8332 vpaddd xmm3,xmm3,xmm1
00007FF7C8CF8336 vpsrld xmm1,xmm7,10h
00007FF7C8CF833B vmovdqu xmmword ptr [rsp+80h],xmm11
00007FF7C8CF8344 vpaddd xmm2,xmm2,xmm3
00007FF7C8CF8348 vpmulld xmm11,xmm9,xmm11
00007FF7C8CF834D vpaddd xmm4,xmm0,xmm2
00007FF7C8CF8351 vpmulld xmm5,xmm10,xmm5
00007FF7C8CF8356 vpand xmm0,xmm11,xmm15
00007FF7C8CF835B vpaddd xmm3,xmm5,xmm6
00007FF7C8CF835F vpand xmm2,xmm12,xmm15
00007FF7C8CF8364 vpaddd xmm0,xmm0,xmm3
00007FF7C8CF8368 vpsrld xmm3,xmm4,10h
00007FF7C8CF836D vpaddd xmm6,xmm1,xmm0
00007FF7C8CF8371 vpsrld xmm11,xmm11,10h
00007FF7C8CF8377 vpmulld xmm1,xmm13,xmmword ptr [rsp+0A0h]
00007FF7C8CF8381 vpsrld xmm5,xmm6,10h
00007FF7C8CF8386 vpaddd xmm0,xmm1,xmm8
00007FF7C8CF838B vpsrld xmm12,xmm12,10h
00007FF7C8CF8391 vpaddd xmm1,xmm2,xmm0
00007FF7C8CF8395 vpand xmm7,xmm7,xmm15
00007FF7C8CF839A vpmulld xmm2,xmm10,xmmword ptr [rsp+80h]
00007FF7C8CF83A4 vpaddd xmm8,xmm3,xmm1
00007FF7C8CF83A8 vmovdqu xmm3,xmmword ptr [rsp+0B0h]
00007FF7C8CF83B1 vpaddd xmm0,xmm2,xmm11
00007FF7C8CF83B6 vpand xmm1,xmm3,xmm15
00007FF7C8CF83BB vpsrld xmm2,xmm8,10h
00007FF7C8CF83C1 vpaddd xmm1,xmm1,xmm0
00007FF7C8CF83C5 vpslld xmm8,xmm8,10h
00007FF7C8CF83CB vpmulld xmm11,xmm13,xmmword ptr [rsp+90h]
00007FF7C8CF83D5 vpaddd xmm5,xmm5,xmm1
00007FF7C8CF83D9 vmovdqu xmm1,xmmword ptr [rsp+0C0h]
00007FF7C8CF83E2 vpaddd xmm11,xmm11,xmm12
00007FF7C8CF83E7 vpand xmm0,xmm1,xmm15
00007FF7C8CF83EC vpsrld xmm12,xmm5,10h
00007FF7C8CF83F1 vpaddd xmm0,xmm0,xmm11
00007FF7C8CF83F6 vpsrld xmm1,xmm1,10h
00007FF7C8CF83FB vmovups xmm11,xmmword ptr [rsp+40h]
00007FF7C8CF8401 vpaddd xmm2,xmm2,xmm0
00007FF7C8CF8405 vpsrld xmm0,xmm11,10h
00007FF7C8CF840B vpand xmm5,xmm5,xmm15
00007FF7C8CF8410 vpmulld xmm10,xmm10,xmmword ptr [rsp+60h]
00007FF7C8CF8417 vmovd xmm11,r9d
00007FF7C8CF841C vpmulld xmm9,xmm9,xmm0
00007FF7C8CF8421 vpsrld xmm0,xmm3,10h
00007FF7C8CF8426 vpand xmm9,xmm9,xmm15
00007FF7C8CF842B vpaddd xmm10,xmm10,xmm0
00007FF7C8CF842F vpaddd xmm3,xmm9,xmm10
00007FF7C8CF8434 vpsrld xmm0,xmm2,10h
00007FF7C8CF8439 vpaddd xmm9,xmm12,xmm3
00007FF7C8CF843D vpand xmm2,xmm2,xmm15
00007FF7C8CF8442 vmovups xmm3,xmmword ptr [rsp+50h]
00007FF7C8CF8448 vpslld xmm12,xmm9,10h
00007FF7C8CF844E vpsrld xmm9,xmm3,10h
00007FF7C8CF8453 vpaddd xmm10,xmm12,xmm5
00007FF7C8CF8457 vpmulld xmm13,xmm13,xmmword ptr [rsp+70h]
00007FF7C8CF845E mov r9d,40C90FDBh
00007FF7C8CF8464 vpmulld xmm14,xmm14,xmm9
00007FF7C8CF8469 vpaddd xmm13,xmm13,xmm1
00007FF7C8CF846D vpand xmm3,xmm14,xmm15
00007FF7C8CF8472 vpand xmm15,xmm4,xmm15
00007FF7C8CF8477 vpaddd xmm9,xmm3,xmm13
00007FF7C8CF847C vmovd xmm4,r10d
00007FF7C8CF8481 vpaddd xmm0,xmm0,xmm9
00007FF7C8CF8486 vpslld xmm14,xmm6,10h
00007FF7C8CF848B vpshufd xmm6,xmm4,0
00007FF7C8CF8490 vpslld xmm12,xmm0,10h
00007FF7C8CF8495 vpand xmm5,xmm6,xmmword ptr [rsp+20h]
00007FF7C8CF849B vpaddd xmm9,xmm14,xmm7
00007FF7C8CF849F vpshufd xmm7,xmm11,0
00007FF7C8CF84A5 vpaddd xmm0,xmm12,xmm2
00007FF7C8CF84A9 vpand xmm14,xmm6,xmmword ptr [rsp+30h]
00007FF7C8CF84AF vpsrld xmm1,xmm10,9
00007FF7C8CF84B5 vpxor xmm3,xmm5,xmm7
00007FF7C8CF84B9 vmovd xmm12,r8d
00007FF7C8CF84BE vpaddd xmm15,xmm8,xmm15
00007FF7C8CF84C3 vpsrld xmm8,xmm0,9
00007FF7C8CF84C8 vpxor xmm4,xmm14,xmm7
00007FF7C8CF84CC vpor xmm2,xmm1,xmm3
00007FF7C8CF84D0 vpshufd xmm6,xmm12,0
00007FF7C8CF84D6 vpor xmm13,xmm8,xmm4
00007FF7C8CF84DA vmovd xmm12,r15d
00007FF7C8CF84DF mov r10d,1FFh
00007FF7C8CF84E5 mov r8d,40C91000h
00007FF7C8CF84EB mov r15d,35800000h
00007FF7C8CF84F1 vinsertf128 ymm1,ymm2,xmm13,1
00007FF7C8CF84F7 vmovd xmm2,edx
00007FF7C8CF84FB vinsertf128 ymm11,ymm6,xmm6,1
00007FF7C8CF8501 mov edx,0FFFFF000h
00007FF7C8CF8506 vaddps ymm4,ymm11,ymm1
00007FF7C8CF850A vpshufd xmm8,xmm2,0
00007FF7C8CF850F vsubps ymm3,ymm4,ymm11
00007FF7C8CF8514 vsubps ymm13,ymm1,ymm3
00007FF7C8CF8518 vpshufd xmm1,xmm12,0
00007FF7C8CF851E vmovd xmm12,r14d
00007FF7C8CF8523 vpxor xmm2,xmm5,xmm1
00007FF7C8CF8527 vpxor xmm3,xmm14,xmm1
00007FF7C8CF852B vpshufd xmm1,xmm12,0
00007FF7C8CF8531 vpand xmm6,xmm1,xmm9
00007FF7C8CF8536 vpand xmm1,xmm1,xmm15
00007FF7C8CF853B vpslld xmm11,xmm6,5
00007FF7C8CF8540 vpslld xmm12,xmm1,5
00007FF7C8CF8545 vpor xmm11,xmm11,xmm2
00007FF7C8CF8549 vpor xmm6,xmm12,xmm3
00007FF7C8CF854D vpsrld xmm9,xmm9,12h
00007FF7C8CF8553 vpsrld xmm15,xmm15,12h
00007FF7C8CF8559 vinsertf128 ymm3,ymm2,xmm3,1
00007FF7C8CF855F vmovd xmm2,r10d
00007FF7C8CF8564 vinsertf128 ymm1,ymm11,xmm6,1
00007FF7C8CF856A vmovd xmm11,ecx
00007FF7C8CF856E vpshufd xmm6,xmm2,0
00007FF7C8CF8573 vmovd xmm2,r9d
00007FF7C8CF8578 vsubps ymm12,ymm1,ymm3
00007FF7C8CF857C vmovd xmm1,r11d
00007FF7C8CF8581 vpand xmm10,xmm6,xmm10
00007FF7C8CF8586 vpand xmm0,xmm6,xmm0
00007FF7C8CF858A vpshufd xmm3,xmm1,0
00007FF7C8CF858F vpslld xmm10,xmm10,0Eh
00007FF7C8CF8595 vpxor xmm5,xmm5,xmm3
00007FF7C8CF8599 vpor xmm10,xmm10,xmm9
00007FF7C8CF859E vpor xmm1,xmm10,xmm5
00007FF7C8CF85A2 vpslld xmm10,xmm0,0Eh
00007FF7C8CF85A7 vpxor xmm14,xmm14,xmm3
00007FF7C8CF85AB vpor xmm10,xmm10,xmm15
00007FF7C8CF85B0 vpor xmm0,xmm10,xmm14
00007FF7C8CF85B5 vinsertf128 ymm3,ymm1,xmm0,1
00007FF7C8CF85BB vinsertf128 ymm14,ymm5,xmm14,1
00007FF7C8CF85C1 vmovd xmm5,r8d
00007FF7C8CF85C6 vsubps ymm10,ymm3,ymm14
00007FF7C8CF85CB vpshufd xmm6,xmm5,0
00007FF7C8CF85D0 vaddps ymm9,ymm13,ymm10
00007FF7C8CF85D5 vsubps ymm0,ymm13,ymm9
00007FF7C8CF85DA vpshufd xmm13,xmm2,0
00007FF7C8CF85DF vaddps ymm1,ymm10,ymm0
00007FF7C8CF85E3 vmovd xmm10,edx
00007FF7C8CF85E7 vpshufd xmm0,xmm10,0
00007FF7C8CF85ED vaddps ymm15,ymm1,ymm12
00007FF7C8CF85F2 vpshufd xmm12,xmm11,0
00007FF7C8CF85F8 vinsertf128 ymm1,ymm0,xmm0,1
00007FF7C8CF85FE vandps ymm5,ymm9,ymm1
00007FF7C8CF8602 vsubps ymm9,ymm9,ymm5
00007FF7C8CF8606 vinsertf128 ymm13,ymm13,xmm13,1
00007FF7C8CF860C vinsertf128 ymm2,ymm6,xmm6,1
00007FF7C8CF8612 vmovd xmm6,r15d
00007FF7C8CF8617 vinsertf128 ymm3,ymm12,xmm12,1
00007FF7C8CF861D vmulps ymm10,ymm2,ymm9
00007FF7C8CF8622 vmulps ymm1,ymm3,ymm5
00007FF7C8CF8626 vmulps ymm14,ymm13,ymm15
00007FF7C8CF862B vmulps ymm3,ymm3,ymm9
00007FF7C8CF8630 vmulps ymm0,ymm2,ymm5
00007FF7C8CF8634 vmovd xmm2,esi
00007FF7C8CF8638 vpshufd xmm5,xmm2,0
00007FF7C8CF863D vpshufd xmm9,xmm6,0
00007FF7C8CF8642 vaddps ymm15,ymm10,ymm1
00007FF7C8CF8646 vaddps ymm10,ymm14,ymm3
00007FF7C8CF864A vaddps ymm3,ymm15,ymm10
00007FF7C8CF864F vaddps ymm1,ymm0,ymm3
00007FF7C8CF8653 vsubps ymm0,ymm0,ymm1
00007FF7C8CF8657 vaddps ymm10,ymm0,ymm3
00007FF7C8CF865B vmovups ymm0,ymmword ptr [r13]
00007FF7C8CF8661 mov esi,dword ptr [rsp+0D0h]
00007FF7C8CF8668 vextractf128 xmm7,ymm4,1
00007FF7C8CF866E vpand xmm4,xmm4,xmm8
00007FF7C8CF8673 vpslld xmm2,xmm4,4
00007FF7C8CF8678 vpand xmm7,xmm7,xmm8
00007FF7C8CF867D vmovd r15d,xmm2
00007FF7C8CF8682 vpextrd r14d,xmm2,1
00007FF7C8CF8688 vpextrd r11d,xmm2,2
00007FF7C8CF868E vpextrd r10d,xmm2,3
00007FF7C8CF8694 vmovd xmm8,dword ptr [r15+rax]
00007FF7C8CF869A vmovd xmm2,dword ptr [r14+rax]
00007FF7C8CF86A0 vinsertf128 ymm13,ymm9,xmm9,1
00007FF7C8CF86A6 vpslld xmm9,xmm7,4
00007FF7C8CF86AB vmovd r9d,xmm9
00007FF7C8CF86B0 vmovd xmm4,dword ptr [r10+rax]
00007FF7C8CF86B6 vpextrd r8d,xmm9,1
00007FF7C8CF86BC vpextrd ecx,xmm9,2
00007FF7C8CF86C2 vpextrd edx,xmm9,3
00007FF7C8CF86C8 vmovd xmm9,dword ptr [r10+rax+4]
00007FF7C8CF86CF vinsertf128 ymm11,ymm5,xmm5,1
00007FF7C8CF86D5 vandps ymm12,ymm0,ymm11
00007FF7C8CF86DA vcmpgt_oqps ymm3,ymm12,ymm13
00007FF7C8CF86E0 vcmple_oqps ymm14,ymm12,ymm13
00007FF7C8CF86E6 vpunpcklqdq xmm5,xmm8,xmm2
00007FF7C8CF86EA vmovd xmm8,dword ptr [r11+rax]
00007FF7C8CF86F0 vpunpcklqdq xmm6,xmm8,xmm4
00007FF7C8CF86F4 vmovd xmm8,dword ptr [r11+rax+4]
00007FF7C8CF86FB vandps ymm15,ymm14,ymm0
00007FF7C8CF86FF vandps ymm1,ymm3,ymm1
00007FF7C8CF8703 vshufps xmm7,xmm5,xmm6,88h
00007FF7C8CF8708 vmovd xmm11,dword ptr [r9+rax]
00007FF7C8CF870E vmovd xmm12,dword ptr [r8+rax]
00007FF7C8CF8714 vmovd xmm13,dword ptr [rcx+rax]
00007FF7C8CF8719 vmovd xmm14,dword ptr [rdx+rax]
00007FF7C8CF871E vmovd xmm5,dword ptr [r15+rax+4]
00007FF7C8CF8725 vmovd xmm6,dword ptr [r14+rax+4]
00007FF7C8CF872C vorps ymm1,ymm15,ymm1
00007FF7C8CF8730 vpunpcklqdq xmm15,xmm11,xmm12
00007FF7C8CF8735 vpunpcklqdq xmm2,xmm13,xmm14
00007FF7C8CF873A vpunpcklqdq xmm11,xmm5,xmm6
00007FF7C8CF873E vpunpcklqdq xmm12,xmm8,xmm9
00007FF7C8CF8743 vshufps xmm4,xmm15,xmm2,88h
00007FF7C8CF8748 vshufps xmm13,xmm11,xmm12,88h
00007FF7C8CF874E vandps ymm10,ymm3,ymm10
00007FF7C8CF8753 vmulps ymm3,ymm1,ymm1
00007FF7C8CF8757 vinsertf128 ymm2,ymm7,xmm4,1
00007FF7C8CF875D vmovd xmm4,dword ptr [r9+rax+4]
_B1_15:
00007FF7C8CF8764 lea rax,[__common_ssin_reduction_data (07FF7C8D01000h)]
00007FF7C8CF876B vmovd xmm11,dword ptr [r8+rax+4]
00007FF7C8CF8772 vmovd xmm12,dword ptr [rcx+rax+4]
00007FF7C8CF8778 vmovd xmm14,dword ptr [rdx+rax+4]
00007FF7C8CF877E vpunpcklqdq xmm15,xmm4,xmm11
00007FF7C8CF8783 vpunpcklqdq xmm8,xmm12,xmm14
00007FF7C8CF8788 vshufps xmm8,xmm15,xmm8,88h
00007FF7C8CF878E vinsertf128 ymm9,ymm13,xmm8,1
00007FF7C8CF8794 vmovd xmm13,dword ptr [r15+rax+0Ch]
00007FF7C8CF879B vmovd xmm8,dword ptr [r14+rax+0Ch]
00007FF7C8CF87A2 vmovd xmm6,dword ptr [r11+rax+0Ch]
00007FF7C8CF87A9 vmovd xmm5,dword ptr [r10+rax+0Ch]
00007FF7C8CF87B0 vpunpcklqdq xmm4,xmm13,xmm8
00007FF7C8CF87B5 vmovd xmm12,dword ptr [r9+rax+0Ch]
00007FF7C8CF87BC vmovd xmm13,dword ptr [r8+rax+0Ch]
00007FF7C8CF87C3 vmovd xmm14,dword ptr [rcx+rax+0Ch]
00007FF7C8CF87C9 vmovd xmm15,dword ptr [rdx+rax+0Ch]
00007FF7C8CF87CF vpunpcklqdq xmm7,xmm6,xmm5
00007FF7C8CF87D3 vpunpcklqdq xmm8,xmm12,xmm13
00007FF7C8CF87D8 vpunpcklqdq xmm6,xmm14,xmm15
00007FF7C8CF87DD vshufps xmm11,xmm4,xmm7,88h
00007FF7C8CF87E2 vshufps xmm5,xmm8,xmm6,88h
00007FF7C8CF87E7 vmulps ymm15,ymm2,ymm1
00007FF7C8CF87EB vinsertf128 ymm7,ymm11,xmm5,1
00007FF7C8CF87F1 vmulps ymm12,ymm1,ymm7
00007FF7C8CF87F5 vaddps ymm13,ymm9,ymm12
00007FF7C8CF87FA vsubps ymm4,ymm9,ymm13
00007FF7C8CF87FF vaddps ymm8,ymm15,ymm13
00007FF7C8CF8804 vaddps ymm5,ymm4,ymm12
00007FF7C8CF8809 vsubps ymm14,ymm13,ymm8
00007FF7C8CF880E vmovd xmm13,dword ptr [r10+rax+8]
00007FF7C8CF8815 vmulps ymm4,ymm3,ymmword ptr [__common_ssin_data+1100h (07FF7C8D02D00h)]
00007FF7C8CF881D vaddps ymm6,ymm14,ymm15
00007FF7C8CF8822 vaddps ymm11,ymm4,ymmword ptr [__common_ssin_data+10C0h (07FF7C8D02CC0h)]
00007FF7C8CF882A vaddps ymm4,ymm2,ymm7
00007FF7C8CF882E vaddps ymm6,ymm6,ymm5
00007FF7C8CF8832 vmovd xmm7,dword ptr [r15+rax+8]
00007FF7C8CF8839 vmulps ymm12,ymm11,ymm3
00007FF7C8CF883D vmulps ymm2,ymm3,ymmword ptr [__common_ssin_data+1180h (07FF7C8D02D80h)]
00007FF7C8CF8845 vmovd xmm11,dword ptr [r14+rax+8]
00007FF7C8CF884C vpunpcklqdq xmm14,xmm7,xmm11
00007FF7C8CF8851 vmovd xmm7,dword ptr [r9+rax+8]
00007FF7C8CF8858 vmovd xmm11,dword ptr [r8+rax+8]
00007FF7C8CF885F vmulps ymm5,ymm12,ymm1
00007FF7C8CF8863 vmulps ymm1,ymm1,ymm9
00007FF7C8CF8868 vmovd xmm12,dword ptr [r11+rax+8]
00007FF7C8CF886F vpunpcklqdq xmm15,xmm12,xmm13
00007FF7C8CF8874 vaddps ymm2,ymm2,ymmword ptr [__common_ssin_data+1140h (07FF7C8D02D40h)]
00007FF7C8CF887C vpunpcklqdq xmm13,xmm7,xmm11
00007FF7C8CF8881 vmovd xmm7,dword ptr [rcx+rax+8]
00007FF7C8CF8887 vsubps ymm1,ymm4,ymm1
00007FF7C8CF888B vmovd xmm12,dword ptr [rdx+rax+8]
00007FF7C8CF8891 vmulps ymm3,ymm2,ymm3
00007FF7C8CF8895 vmulps ymm10,ymm10,ymm1
00007FF7C8CF8899 vshufps xmm2,xmm14,xmm15,88h
00007FF7C8CF889F vpunpcklqdq xmm14,xmm7,xmm12
00007FF7C8CF88A4 vshufps xmm15,xmm13,xmm14,88h
00007FF7C8CF88AA vmulps ymm3,ymm3,ymm9
00007FF7C8CF88AF vmulps ymm5,ymm5,ymm1
00007FF7C8CF88B3 vaddps ymm6,ymm5,ymm6
00007FF7C8CF88B7 vinsertf128 ymm2,ymm2,xmm15,1
00007FF7C8CF88BD vaddps ymm10,ymm10,ymm2
00007FF7C8CF88C1 vaddps ymm4,ymm3,ymm10
00007FF7C8CF88C6 vaddps ymm7,ymm4,ymm6
00007FF7C8CF88CA vaddps ymm1,ymm8,ymm7
00007FF7C8CF88CE jmp __avx_sinf8+15Ch (07FF7C8CF7F5Ch)
00007FF7C8CF88D3 nop word ptr [rax+rax]
_B1_1:
00007FF7C8CF88E0 sub rsp,28h
00007FF7C8CF88E4 mov r8d,dword ptr [rcx]
00007FF7C8CF88E7 movzx eax,word ptr [rcx+2]
00007FF7C8CF88EB mov dword ptr [rsp+20h],r8d
00007FF7C8CF88F0 and eax,7F80h
00007FF7C8CF88F5 shr r8d,18h
00007FF7C8CF88F9 and r8d,7Fh
00007FF7C8CF88FD movss xmm1,dword ptr [rcx]
00007FF7C8CF8901 cmp eax,7F80h
00007FF7C8CF8906 jne __common_ssin_cout_rare+5Ch (07FF7C8CF893Ch)
_B1_2:
00007FF7C8CF8908 mov byte ptr [rsp+23h],r8b
00007FF7C8CF890D cmp dword ptr [rsp+20h],7F800000h
00007FF7C8CF8915 jne __common_ssin_cout_rare+4Dh (07FF7C8CF892Dh)
_B1_3:
00007FF7C8CF8917 mov eax,1
00007FF7C8CF891C pxor xmm0,xmm0
00007FF7C8CF8920 mulss xmm1,xmm0
00007FF7C8CF8924 movss dword ptr [rdx],xmm1
00007FF7C8CF8928 add rsp,28h
00007FF7C8CF892C ret
_B1_4:
00007FF7C8CF892D mulss xmm1,xmm1
00007FF7C8CF8931 xor eax,eax
00007FF7C8CF8933 movss dword ptr [rdx],xmm1
_B1_5:
00007FF7C8CF8937 add rsp,28h
00007FF7C8CF893B ret
_B1_6:
00007FF7C8CF893C xor eax,eax
00007FF7C8CF893E add rsp,28h
00007FF7C8CF8942 ret
[/source]
Now, I'm not for one minute saying this code is in anyway bad. It's accurate, it's relatively well optimised, and it handles all manner of errors I'm unlikely to
ever encounter (e.g. the __common_ssin_cout_rare cases). If you need accuracy, then do use std::sin! (As an aside, its also splitting the YMM regs into XMM, but that's another story)
However, I will draw attention to the function call pre-amble (This is saving the state of the registers into the stack)
[source]
00007FF7C8CF7E00 push rsi
00007FF7C8CF7E01 push r14
00007FF7C8CF7E03 push r15
00007FF7C8CF7E05 sub rsp, 2C0h
00007FF7C8CF7E0C xor esi, esi
00007FF7C8CF7E0E vmovups ymmword ptr[rsp + 2A0h], ymm15
00007FF7C8CF7E17 vmovups ymmword ptr[rsp + 260h], ymm14
00007FF7C8CF7E20 vmovups ymmword ptr[rsp + 220h], ymm13
00007FF7C8CF7E29 vmovups ymmword ptr[rsp + 240h], ymm12
00007FF7C8CF7E32 vmovups ymmword ptr[rsp + 1C0h], ymm11
00007FF7C8CF7E3B vmovups ymmword ptr[rsp + 1E0h], ymm10
00007FF7C8CF7E44 vmovups ymmword ptr[rsp + 1A0h], ymm9
00007FF7C8CF7E4D vmovups ymmword ptr[rsp + 280h], ymm8
00007FF7C8CF7E56 vmovups ymmword ptr[rsp + 180h], ymm7
00007FF7C8CF7E5F vmovups ymmword ptr[rsp + 200h], ymm6
[/source]
and the post-amble: (Which is restoring the values from the stack back into the registers)
[source]
00007FF7C8CF7F61 vmovups ymm6, ymmword ptr[rsp + 200h]
00007FF7C8CF7F6A vmovups ymm7, ymmword ptr[rsp + 180h]
00007FF7C8CF7F73 vmovups ymm8, ymmword ptr[rsp + 280h]
00007FF7C8CF7F7C vmovups ymm9, ymmword ptr[rsp + 1A0h]
00007FF7C8CF7F85 vmovups ymm10, ymmword ptr[rsp + 1E0h]
00007FF7C8CF7F8E vmovups ymm11, ymmword ptr[rsp + 1C0h]
00007FF7C8CF7F97 vmovups ymm12, ymmword ptr[rsp + 240h]
00007FF7C8CF7FA0 vmovups ymm13, ymmword ptr[rsp + 220h]
00007FF7C8CF7FA9 vmovups ymm14, ymmword ptr[rsp + 260h]
00007FF7C8CF7FB2 vmovups ymm15, ymmword ptr[rsp + 2A0h]
00007FF7C8CF7FBB mov r13, qword ptr[rsp + 0D8h]
00007FF7C8CF7FC3 vmovaps ymm0, ymm1
00007FF7C8CF7FC7 add rsp, 2C0h
00007FF7C8CF7FCE pop r15
00007FF7C8CF7FD0 pop r14
00007FF7C8CF7FD2 pop rsi
00007FF7C8CF7FD3 ret
[/source]
Even ignoring *ALL* of the code that actually computes the sine values, I could have computed 16 'good enough for most people' sine values (even with the fmod fixing) in the same time it took to save and restore the stack for 8!!!
So getting back to your original comment. Rather than asking what are you doing that needs the performance, you should instead be asking:
What am I doing that demands the accuracy of a 500+ CPU op function call, instead of a 15 CPU op inlined call?
The answer, for the vast majority of people, is very rarely.
As an aside, I used to work on a middleware animation system with some fairly demanding AAA clients. For any given character, we might be required to slerp 50 -> 100 animations at any given time. Multiply that by the number of characters in a game, and the std lib calls start to add up. If you could handle 20 characters within your allocated 2ms per-frame with the stdlib functions, you could handle 50 using approximations. In the VFX world (where I am now), the difference is even bigger.

What are you guys doing where trigonometric functions are used so heavily as to make a difference in performance ?


This is awfully dismissive, but go on then, I'll bite....


It didn't mean to be dismissive...

Actually, using lots of slerps in animation could be a case where it makes sense to use lots of trigonometric functions. Although I would probably try to find some quick approximation in that case. A quick web search led me to this.


Actually, using lots of slerps in animation could be a valid reason. Although I would probably try to find some quick approximation in that case. A quick web search led me to this.

Don't worry, I've read all the literature. This was an example off the top of my head that would explain a scenario, but without getting into details that would break an NDA. Converting between Matrices/Quats/Eulers/Tan-quarter-angles spring to mind for trig funcs. Pow/log/exp and other such fun, are some of the bigger problems tbh.

You know, sqrt, abs and cbrt are usually per-formant enough to not worry. The rest of <cmath> is usually overkill in terms of accuracy for most people, there are faster alternatives that will do the job 95% of the time (and are often 10x to 50x faster). You might be of the opinion that those improvements aren't worth the cost/benefit analysis of implementing, which is fair enough. But if you have an opportunity to replace 2 of the expensive cmath calls with 1 (via some obscure maths identity), then you should always take it. My 2 cents. (Even if the only result is that you end up with 5 seconds extra battery life when your game is running on mobile - those 5s add up!)

Oh, and if anyone copies & pastes the Quake 3 reciprocal sqrt into their source code, go outside and shout: "I am a terrible human being!", repeatedly, for the rest of time.

I can get rid of divisions and square roots in so many cases by a little overload of trigonometry functions (that is why I seek big power in trigonometry). Even a deeper reforming of abstract logical problems to accomodate them. seems as an intense should-do.

Once reciprocaling or squarerooting, usualy something gets to suck even when it is done-, look at linear interpolating direction vectors to normalize them tardly afterwards , look at spherical coordinates on the other hand, combined with, that position vectors linearily interpolate good.

On cpu side, one can even make a 32bit integer number compact scalar field, and use it for high degree polynomes, now 10 scalared multiplications of integer numbers will give you a trigonometry value of extreme exactness for you to interpert in Real numbers? Imagine how much problems you can now screw up with such a trigonometric power waste.

Division and squareroot functions are impossible to aproximate by polynomes.

Frankly, I do not think that gpu's have any proccessing unit ops for trigonometry, since for long they were only using floating types for all instructions. While trigonometry is most needed in graphics (or possibly everywhere in my opinion).

This topic is closed to new replies.

Advertisement