Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 22 Jun 2002
Offline Last Active Today, 05:13 PM

Posts I've Made

In Topic: N64, 3DO, Atari Jaguar, and PS1 Game Engines

22 April 2016 - 10:44 AM

Back in the day there was the net yaroze, and you can still find them on eBay: http://pages.ebay.com/link/?nav=item.view&alt=web&id=351701873219&globalID=EBAY-GB

To be honest though, they're expensive, and woefully underpowered with limited information around to help you develop games. I had one, but it wasn't as exciting as playing with newer hardware (Multi-texturing on an OpenGL 1.2 capable ATI rage fury + overclocked celeron 300 in my case)

This stuff is interesting from a historical perspective I guess, but I've always found new shiny to be more interesting....

In Topic: What is more expensive in float?

19 April 2016 - 06:38 PM

Oh, and if anyone copies & pastes the Quake 3 reciprocal sqrt into their source code, go outside and shout: "I am a terrible human being!", repeatedly, for the rest of time. 

In Topic: What is more expensive in float?

19 April 2016 - 06:35 PM

Actually, using lots of slerps in animation could be a valid reason. Although I would probably try to find some quick approximation in that case. A quick web search led me to this.


Don't worry, I've read all the literature. This was an example off the top of my head that would explain a scenario, but without getting into details that would break an NDA. Converting between Matrices/Quats/Eulers/Tan-quarter-angles spring to mind for trig funcs. Pow/log/exp and other such fun, are some of the bigger problems tbh.


You know, sqrt, abs and cbrt are usually per-formant enough to not worry. The rest of <cmath> is usually overkill in terms of accuracy for most people, there are faster alternatives that will do the job 95% of the time (and are often 10x to 50x faster). You might be of the opinion that those improvements aren't worth the cost/benefit analysis of implementing, which is fair enough. But if you have an opportunity to replace 2 of the expensive cmath calls with 1 (via some obscure maths identity), then you should always take it. My 2 cents. (Even if the only result is that you end up with 5 seconds extra battery life when your game is running on mobile - those 5s add up!)

In Topic: What is more expensive in float?

19 April 2016 - 05:50 PM

Álvaro, on 18 Apr 2016 - 10:39 PM, said :
  What are you guys doing where trigonometric functions are used so heavily as to make a difference in performance ?
This is awfully dismissive, but go on then, I'll bite.... 
Let's write a rubbish function in C++
__declspec(dllexport) void func1(float* out, const float* a, const float* b, const uint32_t n)
  for (uint32_t i = 0; i < n; ++i)
    out[i] = a[i] + b[i];
I'm exporting the func so that the compiler doesn't simply strip out the method. Any reasonable compiler will be able to happily replace that code with some tasty SIMD goodness, so let's quickly check using something such as Visual C++ 2015... 
  vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]
  vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
  lea eax, DWORD PTR[r10 + 8]
  vmovups YMMWORD PTR[r11 + r10 * 4], ymm1
  vmovups ymm1, YMMWORD PTR[rdx + rax * 4]
  vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]
  add r10d, 16
  vmovups YMMWORD PTR[r11 + rax * 4], ymm1
  cmp r10d, ebx
  jb SHORT $LL4@func1
Note: I've removed some detail from the asm here. VC++ unrolls the loop into 16 floats, and adds some extra code to handle the last elements (up to 15 - which I've removed). I'm only concerned with the innermost loop here!
So that loop boils down to this :
  vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
  an AVX addition, which is using YMM registers(i.e. 8 floats at a time).Good.
  Now let's make a minor change to that code:
__declspec(dllexport) void func2(float* out, const float* a, const float* b, const uint32_t n)
  for (uint32_t i = 0; i < n; ++i)
    out[i] = std::sin(a[i] + b[i]);
So, let's take a look at what that spits out.... 
  vmovups ymm1, YMMWORD PTR[r14 + rsi * 4]
  vaddps ymm0, ymm1, YMMWORD PTR[r12 + rsi * 4]
  call __vdecl_sinf8
  vmovups YMMWORD PTR[rdi + rsi * 4], ymm0
  add esi, 8
  cmp esi, r13d
  jb SHORT $LL4@func2
That's actually not *too* bad, it's inserted a nice SIMD function call here [__vdecl_sinf8] (Some older compilers would actually fail to SIMD this code at all, and end up doing 1 float at a time). The performance here though, depends on the implementation of __vdecl_sinf8, but we'll get to that later.
So let's look at the implementation I posted above:
inline float sine(const float x)
  const float pi = 3.1415926535897932384626433832795f;
  const float B = 4.0f / pi;
  const float C = -4.0f / (pi * pi);
  float y = B * x + C * x * abs(x);
  const float P = 0.225f;
  y = P * (y * std::abs(y) - y) + y;
  return y;
__declspec(dllexport) void func3(float* out, const float* a, const float* b, const uint32_t n)
  for (uint32_t i = 0; i < n; ++i)
    out[i] = sine(a[i] + b[i]);
A quick look at the inner loop of this approach, and we have:
  vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]
  vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]
  vmovups ymm2, ymm1
  vmulps ymm1, ymm6, ymm1
  vmulps ymm4, ymm7, ymm2
  vandps ymm0, ymm2, ymm5
  vfnmadd231ps ymm1, ymm0, ymm4
  vandps ymm2, ymm1, ymm5
  vmovups ymm3, ymm1
  vfmsub231ps ymm1, ymm2, ymm1
  vfmadd231ps ymm3, ymm1, ymm8
  vmovups YMMWORD PTR[r11 + r10 * 4], ymm3
  lea eax, DWORD PTR[r10 + 8]
  add r10d, 16
  vmovups ymm1, YMMWORD PTR[rdx + rax * 4]
  vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]
  vmovups ymm2, ymm1
  vmulps ymm4, ymm2, ymm7
  vandps ymm0, ymm2, ymm5
  vmulps ymm1, ymm1, ymm6
  vfnmadd231ps ymm1, ymm0, ymm4
  vmovups ymm3, ymm1
  vandps ymm2, ymm1, ymm5
  vfmsub231ps ymm1, ymm2, ymm1
  vfmadd231ps ymm3, ymm1, ymm8
  vmovups YMMWORD PTR[r11 + rax * 4], ymm3
  cmp r10d, ebx
  jb SHORT $LL4@func3
You'll notice it's unrolled the loop again, so the actual sine approximation boils down to:
vmovups ymm2, ymm1
vmulps ymm1, ymm6, ymm1
vmulps ymm4, ymm7, ymm2
vandps ymm0, ymm2, ymm5
vfnmadd231ps ymm1, ymm0, ymm4
vandps ymm2, ymm1, ymm5
vmovups ymm3, ymm1
vfmsub231ps ymm1, ymm2, ymm1
vfmadd231ps ymm3, ymm1, ymm8
vmovups YMMWORD PTR[r11 + r10 * 4], ymm3
Which isn't too bad at all (some nice tasty FMA goodness in there). One obvious problem is that it doesn't like numbers in outside the -Pi to +PI range. That may end up adding an extra 5 or so ops to fmod the argument so it fits nicely into that range (not really a biggy with AVX). 
Right then, back to __vdecl_sinf8. Stepping through into the method with a disassembler, I end up facing this:
00007FF7C8CF7E00  push        rsi
00007FF7C8CF7E01  push        r14
00007FF7C8CF7E03  push        r15
00007FF7C8CF7E05  sub         rsp,2C0h
00007FF7C8CF7E0C  xor         esi,esi
00007FF7C8CF7E0E  vmovups     ymmword ptr [rsp+2A0h],ymm15
00007FF7C8CF7E17  vmovups     ymmword ptr [rsp+260h],ymm14
00007FF7C8CF7E20  vmovups     ymmword ptr [rsp+220h],ymm13
00007FF7C8CF7E29  vmovups     ymmword ptr [rsp+240h],ymm12
00007FF7C8CF7E32  vmovups     ymmword ptr [rsp+1C0h],ymm11
00007FF7C8CF7E3B  vmovups     ymmword ptr [rsp+1E0h],ymm10
00007FF7C8CF7E44  vmovups     ymmword ptr [rsp+1A0h],ymm9
00007FF7C8CF7E4D  vmovups     ymmword ptr [rsp+280h],ymm8
00007FF7C8CF7E56  vmovups     ymmword ptr [rsp+180h],ymm7
00007FF7C8CF7E5F  vmovups     ymmword ptr [rsp+200h],ymm6
00007FF7C8CF7E68  vpxor       xmm7,xmm7,xmm7
00007FF7C8CF7E6C  mov         qword ptr [rsp+0D8h],r13
00007FF7C8CF7E74  lea         r13,[rsp+11Fh]
00007FF7C8CF7E7C  vmovups     ymm6,ymmword ptr [__common_ssin_data+1000h (07FF7C8D02C00h)]
00007FF7C8CF7E84  and         r13,0FFFFFFFFFFFFFFC0h
00007FF7C8CF7E88  vandps      ymm2,ymm0,ymm6
00007FF7C8CF7E8C  vcmpgt_oqps ymm1,ymm2,ymmword ptr [__common_ssin_data+1040h (07FF7C8D02C40h)]
00007FF7C8CF7E95  vextractf128 xmm3,ymm1,1
00007FF7C8CF7E9B  vpackssdw   xmm4,xmm1,xmm3
00007FF7C8CF7E9F  vpacksswb   xmm5,xmm4,xmm7
00007FF7C8CF7EA3  vpmovmskb   eax,xmm5
00007FF7C8CF7EA7  test        al,al
00007FF7C8CF7EA9  jne         __avx_sinf8+219h (07FF7C8CF8019h)
00007FF7C8CF7EAF  vmovups     ymm8,ymmword ptr [__common_ssin_data+14C0h (07FF7C8D030C0h)]
00007FF7C8CF7EB7  vmulps      ymm3,ymm2,ymmword ptr [__common_ssin_data+1480h (07FF7C8D03080h)]
00007FF7C8CF7EBF  vaddps      ymm4,ymm3,ymm8
00007FF7C8CF7EC4  vandnps     ymm1,ymm6,ymm0
00007FF7C8CF7EC8  vpslld      xmm5,xmm4,1Fh
00007FF7C8CF7ECD  vsubps      ymm13,ymm4,ymm8
00007FF7C8CF7ED2  vmulps      ymm9,ymm13,ymmword ptr [__common_ssin_data+11C0h (07FF7C8D02DC0h)]
00007FF7C8CF7EDA  vmulps      ymm10,ymm13,ymmword ptr [__common_ssin_data+1200h (07FF7C8D02E00h)]
00007FF7C8CF7EE2  vmulps      ymm12,ymm13,ymmword ptr [__common_ssin_data+1240h (07FF7C8D02E40h)]
00007FF7C8CF7EEA  vmulps      ymm15,ymm13,ymmword ptr [__common_ssin_data+1280h (07FF7C8D02E80h)]
00007FF7C8CF7EF2  vsubps      ymm2,ymm2,ymm9
00007FF7C8CF7EF7  vsubps      ymm11,ymm2,ymm10
00007FF7C8CF7EFC  vsubps      ymm14,ymm11,ymm12
00007FF7C8CF7F01  vsubps      ymm2,ymm14,ymm15
00007FF7C8CF7F06  vmulps      ymm10,ymm2,ymm2
00007FF7C8CF7F0A  vextractf128 xmm6,ymm4,1
00007FF7C8CF7F10  vmulps      ymm4,ymm10,ymmword ptr [__common_ssin_data+1440h (07FF7C8D03040h)]
00007FF7C8CF7F18  vpslld      xmm7,xmm6,1Fh
00007FF7C8CF7F1D  vinsertf128 ymm3,ymm5,xmm7,1
00007FF7C8CF7F23  vaddps      ymm5,ymm4,ymmword ptr [__common_ssin_data+1400h (07FF7C8D03000h)]
00007FF7C8CF7F2B  vmulps      ymm6,ymm5,ymm10
00007FF7C8CF7F30  vaddps      ymm7,ymm6,ymmword ptr [__common_ssin_data+13C0h (07FF7C8D02FC0h)]
00007FF7C8CF7F38  vmulps      ymm8,ymm7,ymm10
00007FF7C8CF7F3D  vaddps      ymm9,ymm8,ymmword ptr [__common_ssin_data+1380h (07FF7C8D02F80h)]
00007FF7C8CF7F45  vmulps      ymm11,ymm9,ymm10
00007FF7C8CF7F4A  vxorps      ymm13,ymm2,ymm3
00007FF7C8CF7F4E  vmulps      ymm12,ymm11,ymm13
00007FF7C8CF7F53  vaddps      ymm14,ymm12,ymm13
00007FF7C8CF7F58  vxorps      ymm1,ymm14,ymm1
00007FF7C8CF7F5C  test        sil,sil
00007FF7C8CF7F5F  jne         __avx_sinf8+1D4h (07FF7C8CF7FD4h)
00007FF7C8CF7F61  vmovups     ymm6,ymmword ptr [rsp+200h]
00007FF7C8CF7F6A  vmovups     ymm7,ymmword ptr [rsp+180h]
00007FF7C8CF7F73  vmovups     ymm8,ymmword ptr [rsp+280h]
00007FF7C8CF7F7C  vmovups     ymm9,ymmword ptr [rsp+1A0h]
00007FF7C8CF7F85  vmovups     ymm10,ymmword ptr [rsp+1E0h]
00007FF7C8CF7F8E  vmovups     ymm11,ymmword ptr [rsp+1C0h]
00007FF7C8CF7F97  vmovups     ymm12,ymmword ptr [rsp+240h]
00007FF7C8CF7FA0  vmovups     ymm13,ymmword ptr [rsp+220h]
00007FF7C8CF7FA9  vmovups     ymm14,ymmword ptr [rsp+260h]
00007FF7C8CF7FB2  vmovups     ymm15,ymmword ptr [rsp+2A0h]
00007FF7C8CF7FBB  mov         r13,qword ptr [rsp+0D8h]
00007FF7C8CF7FC3  vmovaps     ymm0,ymm1
00007FF7C8CF7FC7  add         rsp,2C0h
00007FF7C8CF7FCE  pop         r15
00007FF7C8CF7FD0  pop         r14
00007FF7C8CF7FD2  pop         rsi
00007FF7C8CF7FD3  ret
00007FF7C8CF7FD4  vmovups     ymmword ptr [r13],ymm0
00007FF7C8CF7FDA  vmovups     ymmword ptr [r13+40h],ymm1
00007FF7C8CF7FE0  test        esi,esi
00007FF7C8CF7FE2  je          __avx_sinf8+161h (07FF7C8CF7F61h)
00007FF7C8CF7FE8  xor         r14d,r14d
00007FF7C8CF7FEB  bt          esi,r14d
00007FF7C8CF7FEF  jb          __avx_sinf8+205h (07FF7C8CF8005h)
00007FF7C8CF7FF1  inc         r14d
00007FF7C8CF7FF4  cmp         r14d,20h
00007FF7C8CF7FF8  jl          __avx_sinf8+1EBh (07FF7C8CF7FEBh)
00007FF7C8CF7FFA  vmovups     ymm1,ymmword ptr [r13+40h]
00007FF7C8CF8000  jmp         __avx_sinf8+161h (07FF7C8CF7F61h)
00007FF7C8CF8005  vzeroupper
00007FF7C8CF8008  lea         rcx,[r13+r14*4]
00007FF7C8CF800D  lea         rdx,[r13+r14*4+40h]
00007FF7C8CF8012  call        __common_ssin_cout_rare (07FF7C8CF88E0h)
00007FF7C8CF8017  jmp         __avx_sinf8+1F1h (07FF7C8CF7FF1h)
00007FF7C8CF8019  vmovups     ymm10,ymmword ptr [__common_ssin_data+1080h (07FF7C8D02C80h)]
00007FF7C8CF8021  mov         edx,7F800000h
00007FF7C8CF8026  vmovups     ymmword ptr [r13],ymm0
00007FF7C8CF802C  vmovd       xmm8,edx
00007FF7C8CF8030  vpshufd     xmm13,xmm8,0
00007FF7C8CF8036  vandps      ymm6,ymm10,ymm2
00007FF7C8CF803A  mov         edx,0FFh
00007FF7C8CF803F  vcmpeqps    ymm1,ymm6,ymm10
00007FF7C8CF8045  lea         rax,[__common_ssin_reduction_data (07FF7C8D01000h)]
00007FF7C8CF804C  vpand       xmm4,xmm13,xmm0
00007FF7C8CF8050  vextractf128 xmm15,ymm0,1
00007FF7C8CF8056  vpsrld      xmm14,xmm4,17h
00007FF7C8CF805B  vpand       xmm9,xmm13,xmm15
00007FF7C8CF8060  vpslld      xmm12,xmm14,1
00007FF7C8CF8066  vpsrld      xmm11,xmm9,17h
00007FF7C8CF806C  vpaddd      xmm5,xmm12,xmm14
00007FF7C8CF8071  vpslld      xmm6,xmm11,1
00007FF7C8CF8077  vpaddd      xmm10,xmm6,xmm11
00007FF7C8CF807C  vpslld      xmm9,xmm10,2
00007FF7C8CF8082  vmovd       r14d,xmm9
00007FF7C8CF8087  vmovups     xmmword ptr [rsp+20h],xmm0
00007FF7C8CF808D  vmovups     xmmword ptr [rsp+30h],xmm15
00007FF7C8CF8093  vpextrd     r15d,xmm9,1
00007FF7C8CF8099  vpextrd     esi,xmm9,2
00007FF7C8CF809F  vextractf128 xmm3,ymm1,1
00007FF7C8CF80A5  vpackssdw   xmm2,xmm1,xmm3
00007FF7C8CF80A9  vpslld      xmm1,xmm5,2
00007FF7C8CF80AE  vpacksswb   xmm7,xmm2,xmm7
00007FF7C8CF80B2  vpmovmskb   ecx,xmm7
00007FF7C8CF80B6  vmovd       r8d,xmm1
00007FF7C8CF80BB  vmovd       xmm12,dword ptr [r14+rax]
00007FF7C8CF80C1  vmovd       xmm14,dword ptr [r15+rax]
00007FF7C8CF80C7  mov         dword ptr [rsp+0D0h],ecx
00007FF7C8CF80CE  vpextrd     ecx,xmm9,3
00007FF7C8CF80D4  vpextrd     r10d,xmm1,2
00007FF7C8CF80DA  vpextrd     r11d,xmm1,3
00007FF7C8CF80E0  vpextrd     r9d,xmm1,1
00007FF7C8CF80E6  vmovd       xmm5,dword ptr [rsi+rax]
00007FF7C8CF80EB  vmovd       xmm6,dword ptr [rcx+rax]
00007FF7C8CF80F0  vmovd       xmm7,dword ptr [r10+rax]
00007FF7C8CF80F6  vmovd       xmm8,dword ptr [r11+rax]
00007FF7C8CF80FC  vpunpcklqdq xmm11,xmm12,xmm14
00007FF7C8CF8101  vpunpcklqdq xmm10,xmm5,xmm6
00007FF7C8CF8105  vmovd       xmm5,dword ptr [rsi+rax+4]
00007FF7C8CF810B  vmovd       xmm6,dword ptr [rcx+rax+4]
00007FF7C8CF8111  vpunpcklqdq xmm13,xmm7,xmm8
00007FF7C8CF8116  vshufps     xmm8,xmm11,xmm10,88h
00007FF7C8CF811C  vmovd       xmm12,dword ptr [r14+rax+4]
00007FF7C8CF8123  vmovd       xmm14,dword ptr [r15+rax+4]
00007FF7C8CF812A  vpunpcklqdq xmm10,xmm5,xmm6
00007FF7C8CF812E  vmovd       xmm5,dword ptr [r15+rax+8]
00007FF7C8CF8135  mov         r15d,7FFFFFh
00007FF7C8CF813B  vmovd       xmm3,dword ptr [r8+rax]
00007FF7C8CF8141  vmovd       xmm2,dword ptr [r9+rax]
00007FF7C8CF8147  vpunpcklqdq xmm11,xmm12,xmm14
00007FF7C8CF814C  vmovd       xmm14,dword ptr [r14+rax+8]
00007FF7C8CF8153  mov         r14d,800000h
00007FF7C8CF8159  vpunpcklqdq xmm4,xmm3,xmm2
00007FF7C8CF815D  vmovd       xmm1,dword ptr [r8+rax+4]
00007FF7C8CF8164  vmovd       xmm3,dword ptr [r9+rax+4]
00007FF7C8CF816B  vmovd       xmm2,dword ptr [r10+rax+4]
00007FF7C8CF8172  vmovd       xmm7,dword ptr [r11+rax+4]
00007FF7C8CF8179  vshufps     xmm13,xmm4,xmm13,88h
00007FF7C8CF817F  vpunpcklqdq xmm4,xmm1,xmm3
00007FF7C8CF8183  vpunpcklqdq xmm9,xmm2,xmm7
00007FF7C8CF8187  vmovd       xmm7,dword ptr [r11+rax+8]
00007FF7C8CF818E  mov         r11d,0FFFFh
00007FF7C8CF8194  vmovd       xmm1,dword ptr [r8+rax+8]
00007FF7C8CF819B  mov         r8d,47400000h
00007FF7C8CF81A1  vmovd       xmm3,dword ptr [r9+rax+8]
00007FF7C8CF81A8  mov         r9d,3F800000h
00007FF7C8CF81AE  vmovd       xmm2,dword ptr [r10+rax+8]
00007FF7C8CF81B5  mov         r10d,80000000h
00007FF7C8CF81BB  vshufps     xmm4,xmm4,xmm9,88h
00007FF7C8CF81C1  vpunpcklqdq xmm9,xmm1,xmm3
00007FF7C8CF81C5  vpunpcklqdq xmm12,xmm2,xmm7
00007FF7C8CF81C9  vmovd       xmm7,r15d
00007FF7C8CF81CE  vshufps     xmm1,xmm9,xmm12,88h
00007FF7C8CF81D4  vmovd       xmm9,r14d
00007FF7C8CF81D9  vpshufd     xmm12,xmm7,0
00007FF7C8CF81DE  mov         r15d,28800000h
00007FF7C8CF81E4  vpunpcklqdq xmm3,xmm14,xmm5
00007FF7C8CF81E8  vpand       xmm0,xmm12,xmm0
00007FF7C8CF81EC  vpshufd     xmm5,xmm9,0
00007FF7C8CF81F2  vpand       xmm15,xmm12,xmm15
00007FF7C8CF81F7  vshufps     xmm10,xmm11,xmm10,88h
00007FF7C8CF81FD  vpaddd      xmm14,xmm0,xmm5
00007FF7C8CF8201  vmovd       xmm6,dword ptr [rsi+rax+8]
00007FF7C8CF8207  vmovd       xmm0,r11d
00007FF7C8CF820C  vmovd       xmm11,dword ptr [rcx+rax+8]
00007FF7C8CF8212  lea         rax,[__common_ssin_data (07FF7C8D01C00h)]
00007FF7C8CF8219  mov         r14d,3FFFFh
00007FF7C8CF821F  vpunpcklqdq xmm2,xmm6,xmm11
00007FF7C8CF8224  vpaddd      xmm6,xmm15,xmm5
00007FF7C8CF8228  vpshufd     xmm15,xmm0,0
00007FF7C8CF822D  vpsrld      xmm11,xmm4,10h
00007FF7C8CF8232  vpand       xmm7,xmm13,xmm15
00007FF7C8CF8237  vpand       xmm9,xmm14,xmm15
00007FF7C8CF823C  vmovups     xmmword ptr [rsp+50h],xmm8
00007FF7C8CF8242  vpand       xmm12,xmm8,xmm15
00007FF7C8CF8247  vshufps     xmm3,xmm3,xmm2,88h
00007FF7C8CF824C  vpsrld      xmm8,xmm10,10h
00007FF7C8CF8252  vpand       xmm0,xmm10,xmm15
00007FF7C8CF8257  vpsrld      xmm2,xmm1,10h
00007FF7C8CF825C  vpsrld      xmm10,xmm14,10h
00007FF7C8CF8262  vpand       xmm14,xmm6,xmm15
00007FF7C8CF8267  vmovdqu     xmmword ptr [rsp+60h],xmm7
00007FF7C8CF826D  vpand       xmm5,xmm4,xmm15
00007FF7C8CF8272  vpmulld     xmm7,xmm9,xmm7
00007FF7C8CF8277  vpand       xmm1,xmm1,xmm15
00007FF7C8CF827C  vmovdqu     xmmword ptr [rsp+0B0h],xmm7
00007FF7C8CF8285  vpsrld      xmm4,xmm3,10h
00007FF7C8CF828A  vpmulld     xmm7,xmm10,xmm2
00007FF7C8CF828F  vpand       xmm3,xmm3,xmm15
00007FF7C8CF8294  vpmulld     xmm2,xmm9,xmm2
00007FF7C8CF8299  mov         r11d,34000000h
00007FF7C8CF829F  vmovups     xmmword ptr [rsp+40h],xmm13
00007FF7C8CF82A5  vpsrld      xmm13,xmm6,10h
00007FF7C8CF82AA  vpmulld     xmm6,xmm14,xmm12
00007FF7C8CF82AF  vpsrld      xmm2,xmm2,10h
00007FF7C8CF82B4  vmovdqu     xmmword ptr [rsp+70h],xmm12
00007FF7C8CF82BA  vpaddd      xmm7,xmm7,xmm2
00007FF7C8CF82BE  vmovdqu     xmmword ptr [rsp+90h],xmm8
00007FF7C8CF82C7  mov         ecx,0B795777Ah
00007FF7C8CF82CC  vmovdqu     xmmword ptr [rsp+0A0h],xmm0
00007FF7C8CF82D5  mov         esi,7FFFFFFFh
00007FF7C8CF82DA  vmovdqu     xmmword ptr [rsp+0C0h],xmm6
00007FF7C8CF82E3  vpmulld     xmm12,xmm14,xmm8
00007FF7C8CF82E8  vpmulld     xmm6,xmm9,xmm5
00007FF7C8CF82ED  vpmulld     xmm8,xmm14,xmm0
00007FF7C8CF82F2  vpmulld     xmm0,xmm10,xmm1
00007FF7C8CF82F7  vpand       xmm2,xmm8,xmm15
00007FF7C8CF82FC  vpsrld      xmm1,xmm0,10h
00007FF7C8CF8301  vpand       xmm0,xmm6,xmm15
00007FF7C8CF8306  vpaddd      xmm0,xmm0,xmm7
00007FF7C8CF830A  vpsrld      xmm6,xmm6,10h
00007FF7C8CF830F  vpaddd      xmm7,xmm1,xmm0
00007FF7C8CF8313  vpsrld      xmm8,xmm8,10h
00007FF7C8CF8319  vpmulld     xmm0,xmm13,xmm3
00007FF7C8CF831E  vpmulld     xmm3,xmm13,xmm4
00007FF7C8CF8323  vpsrld      xmm0,xmm0,10h
00007FF7C8CF8328  vpmulld     xmm4,xmm14,xmm4
00007FF7C8CF832D  vpsrld      xmm1,xmm4,10h
00007FF7C8CF8332  vpaddd      xmm3,xmm3,xmm1
00007FF7C8CF8336  vpsrld      xmm1,xmm7,10h
00007FF7C8CF833B  vmovdqu     xmmword ptr [rsp+80h],xmm11
00007FF7C8CF8344  vpaddd      xmm2,xmm2,xmm3
00007FF7C8CF8348  vpmulld     xmm11,xmm9,xmm11
00007FF7C8CF834D  vpaddd      xmm4,xmm0,xmm2
00007FF7C8CF8351  vpmulld     xmm5,xmm10,xmm5
00007FF7C8CF8356  vpand       xmm0,xmm11,xmm15
00007FF7C8CF835B  vpaddd      xmm3,xmm5,xmm6
00007FF7C8CF835F  vpand       xmm2,xmm12,xmm15
00007FF7C8CF8364  vpaddd      xmm0,xmm0,xmm3
00007FF7C8CF8368  vpsrld      xmm3,xmm4,10h
00007FF7C8CF836D  vpaddd      xmm6,xmm1,xmm0
00007FF7C8CF8371  vpsrld      xmm11,xmm11,10h
00007FF7C8CF8377  vpmulld     xmm1,xmm13,xmmword ptr [rsp+0A0h]
00007FF7C8CF8381  vpsrld      xmm5,xmm6,10h
00007FF7C8CF8386  vpaddd      xmm0,xmm1,xmm8
00007FF7C8CF838B  vpsrld      xmm12,xmm12,10h
00007FF7C8CF8391  vpaddd      xmm1,xmm2,xmm0
00007FF7C8CF8395  vpand       xmm7,xmm7,xmm15
00007FF7C8CF839A  vpmulld     xmm2,xmm10,xmmword ptr [rsp+80h]
00007FF7C8CF83A4  vpaddd      xmm8,xmm3,xmm1
00007FF7C8CF83A8  vmovdqu     xmm3,xmmword ptr [rsp+0B0h]
00007FF7C8CF83B1  vpaddd      xmm0,xmm2,xmm11
00007FF7C8CF83B6  vpand       xmm1,xmm3,xmm15
00007FF7C8CF83BB  vpsrld      xmm2,xmm8,10h
00007FF7C8CF83C1  vpaddd      xmm1,xmm1,xmm0
00007FF7C8CF83C5  vpslld      xmm8,xmm8,10h
00007FF7C8CF83CB  vpmulld     xmm11,xmm13,xmmword ptr [rsp+90h]
00007FF7C8CF83D5  vpaddd      xmm5,xmm5,xmm1
00007FF7C8CF83D9  vmovdqu     xmm1,xmmword ptr [rsp+0C0h]
00007FF7C8CF83E2  vpaddd      xmm11,xmm11,xmm12
00007FF7C8CF83E7  vpand       xmm0,xmm1,xmm15
00007FF7C8CF83EC  vpsrld      xmm12,xmm5,10h
00007FF7C8CF83F1  vpaddd      xmm0,xmm0,xmm11
00007FF7C8CF83F6  vpsrld      xmm1,xmm1,10h
00007FF7C8CF83FB  vmovups     xmm11,xmmword ptr [rsp+40h]
00007FF7C8CF8401  vpaddd      xmm2,xmm2,xmm0
00007FF7C8CF8405  vpsrld      xmm0,xmm11,10h
00007FF7C8CF840B  vpand       xmm5,xmm5,xmm15
00007FF7C8CF8410  vpmulld     xmm10,xmm10,xmmword ptr [rsp+60h]
00007FF7C8CF8417  vmovd       xmm11,r9d
00007FF7C8CF841C  vpmulld     xmm9,xmm9,xmm0
00007FF7C8CF8421  vpsrld      xmm0,xmm3,10h
00007FF7C8CF8426  vpand       xmm9,xmm9,xmm15
00007FF7C8CF842B  vpaddd      xmm10,xmm10,xmm0
00007FF7C8CF842F  vpaddd      xmm3,xmm9,xmm10
00007FF7C8CF8434  vpsrld      xmm0,xmm2,10h
00007FF7C8CF8439  vpaddd      xmm9,xmm12,xmm3
00007FF7C8CF843D  vpand       xmm2,xmm2,xmm15
00007FF7C8CF8442  vmovups     xmm3,xmmword ptr [rsp+50h]
00007FF7C8CF8448  vpslld      xmm12,xmm9,10h
00007FF7C8CF844E  vpsrld      xmm9,xmm3,10h
00007FF7C8CF8453  vpaddd      xmm10,xmm12,xmm5
00007FF7C8CF8457  vpmulld     xmm13,xmm13,xmmword ptr [rsp+70h]
00007FF7C8CF845E  mov         r9d,40C90FDBh
00007FF7C8CF8464  vpmulld     xmm14,xmm14,xmm9
00007FF7C8CF8469  vpaddd      xmm13,xmm13,xmm1
00007FF7C8CF846D  vpand       xmm3,xmm14,xmm15
00007FF7C8CF8472  vpand       xmm15,xmm4,xmm15
00007FF7C8CF8477  vpaddd      xmm9,xmm3,xmm13
00007FF7C8CF847C  vmovd       xmm4,r10d
00007FF7C8CF8481  vpaddd      xmm0,xmm0,xmm9
00007FF7C8CF8486  vpslld      xmm14,xmm6,10h
00007FF7C8CF848B  vpshufd     xmm6,xmm4,0
00007FF7C8CF8490  vpslld      xmm12,xmm0,10h
00007FF7C8CF8495  vpand       xmm5,xmm6,xmmword ptr [rsp+20h]
00007FF7C8CF849B  vpaddd      xmm9,xmm14,xmm7
00007FF7C8CF849F  vpshufd     xmm7,xmm11,0
00007FF7C8CF84A5  vpaddd      xmm0,xmm12,xmm2
00007FF7C8CF84A9  vpand       xmm14,xmm6,xmmword ptr [rsp+30h]
00007FF7C8CF84AF  vpsrld      xmm1,xmm10,9
00007FF7C8CF84B5  vpxor       xmm3,xmm5,xmm7
00007FF7C8CF84B9  vmovd       xmm12,r8d
00007FF7C8CF84BE  vpaddd      xmm15,xmm8,xmm15
00007FF7C8CF84C3  vpsrld      xmm8,xmm0,9
00007FF7C8CF84C8  vpxor       xmm4,xmm14,xmm7
00007FF7C8CF84CC  vpor        xmm2,xmm1,xmm3
00007FF7C8CF84D0  vpshufd     xmm6,xmm12,0
00007FF7C8CF84D6  vpor        xmm13,xmm8,xmm4
00007FF7C8CF84DA  vmovd       xmm12,r15d
00007FF7C8CF84DF  mov         r10d,1FFh
00007FF7C8CF84E5  mov         r8d,40C91000h
00007FF7C8CF84EB  mov         r15d,35800000h
00007FF7C8CF84F1  vinsertf128 ymm1,ymm2,xmm13,1
00007FF7C8CF84F7  vmovd       xmm2,edx
00007FF7C8CF84FB  vinsertf128 ymm11,ymm6,xmm6,1
00007FF7C8CF8501  mov         edx,0FFFFF000h
00007FF7C8CF8506  vaddps      ymm4,ymm11,ymm1
00007FF7C8CF850A  vpshufd     xmm8,xmm2,0
00007FF7C8CF850F  vsubps      ymm3,ymm4,ymm11
00007FF7C8CF8514  vsubps      ymm13,ymm1,ymm3
00007FF7C8CF8518  vpshufd     xmm1,xmm12,0
00007FF7C8CF851E  vmovd       xmm12,r14d
00007FF7C8CF8523  vpxor       xmm2,xmm5,xmm1
00007FF7C8CF8527  vpxor       xmm3,xmm14,xmm1
00007FF7C8CF852B  vpshufd     xmm1,xmm12,0
00007FF7C8CF8531  vpand       xmm6,xmm1,xmm9
00007FF7C8CF8536  vpand       xmm1,xmm1,xmm15
00007FF7C8CF853B  vpslld      xmm11,xmm6,5
00007FF7C8CF8540  vpslld      xmm12,xmm1,5
00007FF7C8CF8545  vpor        xmm11,xmm11,xmm2
00007FF7C8CF8549  vpor        xmm6,xmm12,xmm3
00007FF7C8CF854D  vpsrld      xmm9,xmm9,12h
00007FF7C8CF8553  vpsrld      xmm15,xmm15,12h
00007FF7C8CF8559  vinsertf128 ymm3,ymm2,xmm3,1
00007FF7C8CF855F  vmovd       xmm2,r10d
00007FF7C8CF8564  vinsertf128 ymm1,ymm11,xmm6,1
00007FF7C8CF856A  vmovd       xmm11,ecx
00007FF7C8CF856E  vpshufd     xmm6,xmm2,0
00007FF7C8CF8573  vmovd       xmm2,r9d
00007FF7C8CF8578  vsubps      ymm12,ymm1,ymm3
00007FF7C8CF857C  vmovd       xmm1,r11d
00007FF7C8CF8581  vpand       xmm10,xmm6,xmm10
00007FF7C8CF8586  vpand       xmm0,xmm6,xmm0
00007FF7C8CF858A  vpshufd     xmm3,xmm1,0
00007FF7C8CF858F  vpslld      xmm10,xmm10,0Eh
00007FF7C8CF8595  vpxor       xmm5,xmm5,xmm3
00007FF7C8CF8599  vpor        xmm10,xmm10,xmm9
00007FF7C8CF859E  vpor        xmm1,xmm10,xmm5
00007FF7C8CF85A2  vpslld      xmm10,xmm0,0Eh
00007FF7C8CF85A7  vpxor       xmm14,xmm14,xmm3
00007FF7C8CF85AB  vpor        xmm10,xmm10,xmm15
00007FF7C8CF85B0  vpor        xmm0,xmm10,xmm14
00007FF7C8CF85B5  vinsertf128 ymm3,ymm1,xmm0,1
00007FF7C8CF85BB  vinsertf128 ymm14,ymm5,xmm14,1
00007FF7C8CF85C1  vmovd       xmm5,r8d
00007FF7C8CF85C6  vsubps      ymm10,ymm3,ymm14
00007FF7C8CF85CB  vpshufd     xmm6,xmm5,0
00007FF7C8CF85D0  vaddps      ymm9,ymm13,ymm10
00007FF7C8CF85D5  vsubps      ymm0,ymm13,ymm9
00007FF7C8CF85DA  vpshufd     xmm13,xmm2,0
00007FF7C8CF85DF  vaddps      ymm1,ymm10,ymm0
00007FF7C8CF85E3  vmovd       xmm10,edx
00007FF7C8CF85E7  vpshufd     xmm0,xmm10,0
00007FF7C8CF85ED  vaddps      ymm15,ymm1,ymm12
00007FF7C8CF85F2  vpshufd     xmm12,xmm11,0
00007FF7C8CF85F8  vinsertf128 ymm1,ymm0,xmm0,1
00007FF7C8CF85FE  vandps      ymm5,ymm9,ymm1
00007FF7C8CF8602  vsubps      ymm9,ymm9,ymm5
00007FF7C8CF8606  vinsertf128 ymm13,ymm13,xmm13,1
00007FF7C8CF860C  vinsertf128 ymm2,ymm6,xmm6,1
00007FF7C8CF8612  vmovd       xmm6,r15d
00007FF7C8CF8617  vinsertf128 ymm3,ymm12,xmm12,1
00007FF7C8CF861D  vmulps      ymm10,ymm2,ymm9
00007FF7C8CF8622  vmulps      ymm1,ymm3,ymm5
00007FF7C8CF8626  vmulps      ymm14,ymm13,ymm15
00007FF7C8CF862B  vmulps      ymm3,ymm3,ymm9
00007FF7C8CF8630  vmulps      ymm0,ymm2,ymm5
00007FF7C8CF8634  vmovd       xmm2,esi
00007FF7C8CF8638  vpshufd     xmm5,xmm2,0
00007FF7C8CF863D  vpshufd     xmm9,xmm6,0
00007FF7C8CF8642  vaddps      ymm15,ymm10,ymm1
00007FF7C8CF8646  vaddps      ymm10,ymm14,ymm3
00007FF7C8CF864A  vaddps      ymm3,ymm15,ymm10
00007FF7C8CF864F  vaddps      ymm1,ymm0,ymm3
00007FF7C8CF8653  vsubps      ymm0,ymm0,ymm1
00007FF7C8CF8657  vaddps      ymm10,ymm0,ymm3
00007FF7C8CF865B  vmovups     ymm0,ymmword ptr [r13]
00007FF7C8CF8661  mov         esi,dword ptr [rsp+0D0h]
00007FF7C8CF8668  vextractf128 xmm7,ymm4,1
00007FF7C8CF866E  vpand       xmm4,xmm4,xmm8
00007FF7C8CF8673  vpslld      xmm2,xmm4,4
00007FF7C8CF8678  vpand       xmm7,xmm7,xmm8
00007FF7C8CF867D  vmovd       r15d,xmm2
00007FF7C8CF8682  vpextrd     r14d,xmm2,1
00007FF7C8CF8688  vpextrd     r11d,xmm2,2
00007FF7C8CF868E  vpextrd     r10d,xmm2,3
00007FF7C8CF8694  vmovd       xmm8,dword ptr [r15+rax]
00007FF7C8CF869A  vmovd       xmm2,dword ptr [r14+rax]
00007FF7C8CF86A0  vinsertf128 ymm13,ymm9,xmm9,1
00007FF7C8CF86A6  vpslld      xmm9,xmm7,4
00007FF7C8CF86AB  vmovd       r9d,xmm9
00007FF7C8CF86B0  vmovd       xmm4,dword ptr [r10+rax]
00007FF7C8CF86B6  vpextrd     r8d,xmm9,1
00007FF7C8CF86BC  vpextrd     ecx,xmm9,2
00007FF7C8CF86C2  vpextrd     edx,xmm9,3
00007FF7C8CF86C8  vmovd       xmm9,dword ptr [r10+rax+4]
00007FF7C8CF86CF  vinsertf128 ymm11,ymm5,xmm5,1
00007FF7C8CF86D5  vandps      ymm12,ymm0,ymm11
00007FF7C8CF86DA  vcmpgt_oqps ymm3,ymm12,ymm13
00007FF7C8CF86E0  vcmple_oqps ymm14,ymm12,ymm13
00007FF7C8CF86E6  vpunpcklqdq xmm5,xmm8,xmm2
00007FF7C8CF86EA  vmovd       xmm8,dword ptr [r11+rax]
00007FF7C8CF86F0  vpunpcklqdq xmm6,xmm8,xmm4
00007FF7C8CF86F4  vmovd       xmm8,dword ptr [r11+rax+4]
00007FF7C8CF86FB  vandps      ymm15,ymm14,ymm0
00007FF7C8CF86FF  vandps      ymm1,ymm3,ymm1
00007FF7C8CF8703  vshufps     xmm7,xmm5,xmm6,88h
00007FF7C8CF8708  vmovd       xmm11,dword ptr [r9+rax]
00007FF7C8CF870E  vmovd       xmm12,dword ptr [r8+rax]
00007FF7C8CF8714  vmovd       xmm13,dword ptr [rcx+rax]
00007FF7C8CF8719  vmovd       xmm14,dword ptr [rdx+rax]
00007FF7C8CF871E  vmovd       xmm5,dword ptr [r15+rax+4]
00007FF7C8CF8725  vmovd       xmm6,dword ptr [r14+rax+4]
00007FF7C8CF872C  vorps       ymm1,ymm15,ymm1
00007FF7C8CF8730  vpunpcklqdq xmm15,xmm11,xmm12
00007FF7C8CF8735  vpunpcklqdq xmm2,xmm13,xmm14
00007FF7C8CF873A  vpunpcklqdq xmm11,xmm5,xmm6
00007FF7C8CF873E  vpunpcklqdq xmm12,xmm8,xmm9
00007FF7C8CF8743  vshufps     xmm4,xmm15,xmm2,88h
00007FF7C8CF8748  vshufps     xmm13,xmm11,xmm12,88h
00007FF7C8CF874E  vandps      ymm10,ymm3,ymm10
00007FF7C8CF8753  vmulps      ymm3,ymm1,ymm1
00007FF7C8CF8757  vinsertf128 ymm2,ymm7,xmm4,1
00007FF7C8CF875D  vmovd       xmm4,dword ptr [r9+rax+4]
00007FF7C8CF8764  lea         rax,[__common_ssin_reduction_data (07FF7C8D01000h)]
00007FF7C8CF876B  vmovd       xmm11,dword ptr [r8+rax+4]
00007FF7C8CF8772  vmovd       xmm12,dword ptr [rcx+rax+4]
00007FF7C8CF8778  vmovd       xmm14,dword ptr [rdx+rax+4]
00007FF7C8CF877E  vpunpcklqdq xmm15,xmm4,xmm11
00007FF7C8CF8783  vpunpcklqdq xmm8,xmm12,xmm14
00007FF7C8CF8788  vshufps     xmm8,xmm15,xmm8,88h
00007FF7C8CF878E  vinsertf128 ymm9,ymm13,xmm8,1
00007FF7C8CF8794  vmovd       xmm13,dword ptr [r15+rax+0Ch]
00007FF7C8CF879B  vmovd       xmm8,dword ptr [r14+rax+0Ch]
00007FF7C8CF87A2  vmovd       xmm6,dword ptr [r11+rax+0Ch]
00007FF7C8CF87A9  vmovd       xmm5,dword ptr [r10+rax+0Ch]
00007FF7C8CF87B0  vpunpcklqdq xmm4,xmm13,xmm8
00007FF7C8CF87B5  vmovd       xmm12,dword ptr [r9+rax+0Ch]
00007FF7C8CF87BC  vmovd       xmm13,dword ptr [r8+rax+0Ch]
00007FF7C8CF87C3  vmovd       xmm14,dword ptr [rcx+rax+0Ch]
00007FF7C8CF87C9  vmovd       xmm15,dword ptr [rdx+rax+0Ch]
00007FF7C8CF87CF  vpunpcklqdq xmm7,xmm6,xmm5
00007FF7C8CF87D3  vpunpcklqdq xmm8,xmm12,xmm13
00007FF7C8CF87D8  vpunpcklqdq xmm6,xmm14,xmm15
00007FF7C8CF87DD  vshufps     xmm11,xmm4,xmm7,88h
00007FF7C8CF87E2  vshufps     xmm5,xmm8,xmm6,88h
00007FF7C8CF87E7  vmulps      ymm15,ymm2,ymm1
00007FF7C8CF87EB  vinsertf128 ymm7,ymm11,xmm5,1
00007FF7C8CF87F1  vmulps      ymm12,ymm1,ymm7
00007FF7C8CF87F5  vaddps      ymm13,ymm9,ymm12
00007FF7C8CF87FA  vsubps      ymm4,ymm9,ymm13
00007FF7C8CF87FF  vaddps      ymm8,ymm15,ymm13
00007FF7C8CF8804  vaddps      ymm5,ymm4,ymm12
00007FF7C8CF8809  vsubps      ymm14,ymm13,ymm8
00007FF7C8CF880E  vmovd       xmm13,dword ptr [r10+rax+8]
00007FF7C8CF8815  vmulps      ymm4,ymm3,ymmword ptr [__common_ssin_data+1100h (07FF7C8D02D00h)]
00007FF7C8CF881D  vaddps      ymm6,ymm14,ymm15
00007FF7C8CF8822  vaddps      ymm11,ymm4,ymmword ptr [__common_ssin_data+10C0h (07FF7C8D02CC0h)]
00007FF7C8CF882A  vaddps      ymm4,ymm2,ymm7
00007FF7C8CF882E  vaddps      ymm6,ymm6,ymm5
00007FF7C8CF8832  vmovd       xmm7,dword ptr [r15+rax+8]
00007FF7C8CF8839  vmulps      ymm12,ymm11,ymm3
00007FF7C8CF883D  vmulps      ymm2,ymm3,ymmword ptr [__common_ssin_data+1180h (07FF7C8D02D80h)]
00007FF7C8CF8845  vmovd       xmm11,dword ptr [r14+rax+8]
00007FF7C8CF884C  vpunpcklqdq xmm14,xmm7,xmm11
00007FF7C8CF8851  vmovd       xmm7,dword ptr [r9+rax+8]
00007FF7C8CF8858  vmovd       xmm11,dword ptr [r8+rax+8]
00007FF7C8CF885F  vmulps      ymm5,ymm12,ymm1
00007FF7C8CF8863  vmulps      ymm1,ymm1,ymm9
00007FF7C8CF8868  vmovd       xmm12,dword ptr [r11+rax+8]
00007FF7C8CF886F  vpunpcklqdq xmm15,xmm12,xmm13
00007FF7C8CF8874  vaddps      ymm2,ymm2,ymmword ptr [__common_ssin_data+1140h (07FF7C8D02D40h)]
00007FF7C8CF887C  vpunpcklqdq xmm13,xmm7,xmm11
00007FF7C8CF8881  vmovd       xmm7,dword ptr [rcx+rax+8]
00007FF7C8CF8887  vsubps      ymm1,ymm4,ymm1
00007FF7C8CF888B  vmovd       xmm12,dword ptr [rdx+rax+8]
00007FF7C8CF8891  vmulps      ymm3,ymm2,ymm3
00007FF7C8CF8895  vmulps      ymm10,ymm10,ymm1
00007FF7C8CF8899  vshufps     xmm2,xmm14,xmm15,88h
00007FF7C8CF889F  vpunpcklqdq xmm14,xmm7,xmm12
00007FF7C8CF88A4  vshufps     xmm15,xmm13,xmm14,88h
00007FF7C8CF88AA  vmulps      ymm3,ymm3,ymm9
00007FF7C8CF88AF  vmulps      ymm5,ymm5,ymm1
00007FF7C8CF88B3  vaddps      ymm6,ymm5,ymm6
00007FF7C8CF88B7  vinsertf128 ymm2,ymm2,xmm15,1
00007FF7C8CF88BD  vaddps      ymm10,ymm10,ymm2
00007FF7C8CF88C1  vaddps      ymm4,ymm3,ymm10
00007FF7C8CF88C6  vaddps      ymm7,ymm4,ymm6
00007FF7C8CF88CA  vaddps      ymm1,ymm8,ymm7
00007FF7C8CF88CE  jmp         __avx_sinf8+15Ch (07FF7C8CF7F5Ch)
00007FF7C8CF88D3  nop         word ptr [rax+rax]
00007FF7C8CF88E0  sub         rsp,28h
00007FF7C8CF88E4  mov         r8d,dword ptr [rcx]
00007FF7C8CF88E7  movzx       eax,word ptr [rcx+2]
00007FF7C8CF88EB  mov         dword ptr [rsp+20h],r8d
00007FF7C8CF88F0  and         eax,7F80h
00007FF7C8CF88F5  shr         r8d,18h
00007FF7C8CF88F9  and         r8d,7Fh
00007FF7C8CF88FD  movss       xmm1,dword ptr [rcx]
00007FF7C8CF8901  cmp         eax,7F80h
00007FF7C8CF8906  jne         __common_ssin_cout_rare+5Ch (07FF7C8CF893Ch)
00007FF7C8CF8908  mov         byte ptr [rsp+23h],r8b
00007FF7C8CF890D  cmp         dword ptr [rsp+20h],7F800000h
00007FF7C8CF8915  jne         __common_ssin_cout_rare+4Dh (07FF7C8CF892Dh)
00007FF7C8CF8917  mov         eax,1
00007FF7C8CF891C  pxor        xmm0,xmm0
00007FF7C8CF8920  mulss       xmm1,xmm0
00007FF7C8CF8924  movss       dword ptr [rdx],xmm1
00007FF7C8CF8928  add         rsp,28h
00007FF7C8CF892C  ret
00007FF7C8CF892D  mulss       xmm1,xmm1
00007FF7C8CF8931  xor         eax,eax
00007FF7C8CF8933  movss       dword ptr [rdx],xmm1
00007FF7C8CF8937  add         rsp,28h
00007FF7C8CF893B  ret
00007FF7C8CF893C  xor         eax,eax
00007FF7C8CF893E  add         rsp,28h
00007FF7C8CF8942  ret
Now, I'm not for one minute saying this code is in anyway bad. It's accurate, it's relatively well optimised, and it handles all manner of errors I'm unlikely to 
ever encounter (e.g. the __common_ssin_cout_rare cases). If you need accuracy, then do use std::sin! (As an aside, its also splitting the YMM regs into XMM, but that's another story)
However, I will draw attention to the function call pre-amble (This is saving the state of the registers into the stack)
00007FF7C8CF7E00  push        rsi
00007FF7C8CF7E01  push        r14
00007FF7C8CF7E03  push        r15
00007FF7C8CF7E05  sub         rsp, 2C0h
00007FF7C8CF7E0C  xor         esi, esi
00007FF7C8CF7E0E  vmovups     ymmword ptr[rsp + 2A0h], ymm15
00007FF7C8CF7E17  vmovups     ymmword ptr[rsp + 260h], ymm14
00007FF7C8CF7E20  vmovups     ymmword ptr[rsp + 220h], ymm13
00007FF7C8CF7E29  vmovups     ymmword ptr[rsp + 240h], ymm12
00007FF7C8CF7E32  vmovups     ymmword ptr[rsp + 1C0h], ymm11
00007FF7C8CF7E3B  vmovups     ymmword ptr[rsp + 1E0h], ymm10
00007FF7C8CF7E44  vmovups     ymmword ptr[rsp + 1A0h], ymm9
00007FF7C8CF7E4D  vmovups     ymmword ptr[rsp + 280h], ymm8
00007FF7C8CF7E56  vmovups     ymmword ptr[rsp + 180h], ymm7
00007FF7C8CF7E5F  vmovups     ymmword ptr[rsp + 200h], ymm6
and the post-amble: (Which is restoring the values from the stack back into the registers)
00007FF7C8CF7F61  vmovups     ymm6, ymmword ptr[rsp + 200h]
00007FF7C8CF7F6A  vmovups     ymm7, ymmword ptr[rsp + 180h]
00007FF7C8CF7F73  vmovups     ymm8, ymmword ptr[rsp + 280h]
00007FF7C8CF7F7C  vmovups     ymm9, ymmword ptr[rsp + 1A0h]
00007FF7C8CF7F85  vmovups     ymm10, ymmword ptr[rsp + 1E0h]
00007FF7C8CF7F8E  vmovups     ymm11, ymmword ptr[rsp + 1C0h]
00007FF7C8CF7F97  vmovups     ymm12, ymmword ptr[rsp + 240h]
00007FF7C8CF7FA0  vmovups     ymm13, ymmword ptr[rsp + 220h]
00007FF7C8CF7FA9  vmovups     ymm14, ymmword ptr[rsp + 260h]
00007FF7C8CF7FB2  vmovups     ymm15, ymmword ptr[rsp + 2A0h]
00007FF7C8CF7FBB  mov         r13, qword ptr[rsp + 0D8h]
00007FF7C8CF7FC3  vmovaps     ymm0, ymm1
00007FF7C8CF7FC7  add         rsp, 2C0h
00007FF7C8CF7FCE  pop         r15
00007FF7C8CF7FD0  pop         r14
00007FF7C8CF7FD2  pop         rsi
00007FF7C8CF7FD3  ret
Even ignoring *ALL* of the code that actually computes the sine values, I could have computed 16 'good enough for most people' sine values (even with the fmod fixing) in the same time it took to save and restore the stack for 8!!!
So getting back to your original comment. Rather than asking what are you doing that needs the performance, you should instead be asking: 
What am I doing that demands the accuracy of a 500+ CPU op function call, instead of a 15 CPU op inlined call?
The answer, for the vast majority of people, is very rarely. 
As an aside, I used to work on a middleware animation system with some fairly demanding AAA clients. For any given character, we might be required to slerp 50 -> 100 animations at any given time. Multiply that by the number of characters in a game, and the std lib calls start to add up. If you could handle 20 characters within your allocated 2ms per-frame with the stdlib functions, you could handle 50 using approximations. In the VFX world (where I am now), the difference is even bigger.

In Topic: What is more expensive in float?

18 April 2016 - 03:24 PM

My guess would be that the square root and one trigonometric function might be slightly faster, but I don't know which CPU you are using, so YMMV.

<insert obligatory rant about premature optimization here, blablabla>

Thanks for the guess tho, it is a cpu standard op, who knows of what advanced math lib, wheather intel or amd native ops, I believe they should not differ on this?
Sines/cosines are standard maths functions. It is possible to Improve on the standard library implementations (depending on how much accuracy you are willing to sacrifice). So yeah, if you can find a way to reduce N cmath calls by one or more, then it's typically a good thing. I would be semi-inclined to make the switch from 4 funcs to 1 + sqrt without bothering to profile. It's highly unlikely that it will be slower (and that code can change between platforms, so an improvement on one platform might not be better on another).

*IF* fast math is enabled in your compiler settings, then sqrt is a CPU instruction (if using strict or precise, then typically a standard library function will be used).

I think otherwise, inverse number is so expensive, while trigonometric functions can be aproximated by tylor polynomes in few degrees so well. I am going to profile, but I hope for more guesses :)

There are some fairly decent arc-cos / arc-sin approximations around. Certainly the approximations I use aren't substantially worse than the non inverse functions.

Worth reading these:


Generally speaking, using floats will be quicker than double, but YMMV. (Long topic, so I'll leave that can of worms shut for now)