Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

2553 Excellent

1 Follower

About RobTheBloke

  • Rank

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. RobTheBloke

    Manually loading OpenGL functions on Windows

    For typical windows apps, you're correct. For minimal 64k apps though, manually loading OpenGL is not a bad idea. You can allocate a table to store the pointers at runtime, which actually cuts down the number of global variables in your exe (which would have been used to store each function pointer to a glcall you are using). You can also concatonate strings to build up the function names to extract. It may seem like overkill, but it does actually cut down on memory usage. It's a very common approach in the demo scene....
  2. RobTheBloke

    Bubble Sort Algorithm Review

    Your display list function.... It could take a const ref to the vector (rather than making a copy each time). Again, that counter has returned! If you want to see if the vector is empty, use the empty method! (size is typically implemented as a subtraction, so can be very modestly more expensive). Therefore repeatedly checking whether counter is equal to the size() is not considered good practice! The if(iter != list.end()) check in the for loop is pointless. That condition is already checked by the for loop condition, so it can never enter the 'else' Repeatedly doing if/else's for commas is probably a bad idea. [source] if(theList.empty()) { // blah } else { auto it = theList.begin(); auto end = theList.end(); cout << *it++; for(; it != end; ++it) cout << ", " << *it; } [/source]
  3. RobTheBloke

    Bubble Sort Algorithm Review

    Your for loop seems a bit rubbish :) Why use iterators as your for loop condition, and also employ a counter? You could just use a counter? That would allow you to avoid the constant if/else within the loop (since your end condition can be size - 1). The alternative, would be to use two iterators (which may save some additional costs associated with the array element access - not in itself expensive - but since you are passing in a reference to a vector, some compilers may force an additional pointless load op on 'begin' for each element access). Again, you can engineer your loop to test the 'next' element is not equal to end. [source] if(theList.size() < 2) return; auto end = theList.end(); bool altered_list = true; while(altered_list) { altered_list = false; auto it = theList.begin(); auto next = it + 1; for(; next != end; ++it, ++next) { if(*it > *next) { std::swap(*it, *next); altered_list = true; } } } [/source] I suppose the correct way to swap elements would be with std::swap, which would remove the 'temp' variable, and make the code a little clearer to read. The pass through counter seems to be a little superfluous (unless it's just a formatting thing). The 'sorted' flag serves no purpose. You already have 'altered_list', which gives you the same information.
  4. RobTheBloke

    N64, 3DO, Atari Jaguar, and PS1 Game Engines

    Back in the day there was the net yaroze, and you can still find them on eBay: http://pages.ebay.com/link/?nav=item.view&alt=web&id=351701873219&globalID=EBAY-GB To be honest though, they're expensive, and woefully underpowered with limited information around to help you develop games. I had one, but it wasn't as exciting as playing with newer hardware (Multi-texturing on an OpenGL 1.2 capable ATI rage fury + overclocked celeron 300 in my case) This stuff is interesting from a historical perspective I guess, but I've always found new shiny to be more interesting....
  5. RobTheBloke

    What is more expensive in float?

    Oh, and if anyone copies & pastes the Quake 3 reciprocal sqrt into their source code, go outside and shout: "I am a terrible human being!", repeatedly, for the rest of time. 
  6. RobTheBloke

    What is more expensive in float?

      Don't worry, I've read all the literature. This was an example off the top of my head that would explain a scenario, but without getting into details that would break an NDA. Converting between Matrices/Quats/Eulers/Tan-quarter-angles spring to mind for trig funcs. Pow/log/exp and other such fun, are some of the bigger problems tbh.   You know, sqrt, abs and cbrt are usually per-formant enough to not worry. The rest of <cmath> is usually overkill in terms of accuracy for most people, there are faster alternatives that will do the job 95% of the time (and are often 10x to 50x faster). You might be of the opinion that those improvements aren't worth the cost/benefit analysis of implementing, which is fair enough. But if you have an opportunity to replace 2 of the expensive cmath calls with 1 (via some obscure maths identity), then you should always take it. My 2 cents. (Even if the only result is that you end up with 5 seconds extra battery life when your game is running on mobile - those 5s add up!)
  7. RobTheBloke

    What is more expensive in float?

      What are you guys doing where trigonometric functions are used so heavily as to make a difference in performance ?   This is awfully dismissive, but go on then, I'll bite....    Let's write a rubbish function in C++   [source] __declspec(dllexport) void func1(float* out, const float* a, const float* b, const uint32_t n) {   for (uint32_t i = 0; i < n; ++i)   {     out = a + b;   } } [/source]   I'm exporting the func so that the compiler doesn't simply strip out the method. Any reasonable compiler will be able to happily replace that code with some tasty SIMD goodness, so let's quickly check using something such as Visual C++ 2015...    [source] $LL4@func1:   vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]   vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]   lea eax, DWORD PTR[r10 + 8]   vmovups YMMWORD PTR[r11 + r10 * 4], ymm1   vmovups ymm1, YMMWORD PTR[rdx + rax * 4]   vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]   add r10d, 16   vmovups YMMWORD PTR[r11 + rax * 4], ymm1   cmp r10d, ebx   jb SHORT $LL4@func1 [/source]   Note: I've removed some detail from the asm here. VC++ unrolls the loop into 16 floats, and adds some extra code to handle the last elements (up to 15 - which I've removed). I'm only concerned with the innermost loop here!   So that loop boils down to this :     vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]     an AVX addition, which is using YMM registers(i.e. 8 floats at a time).Good.     Now let's make a minor change to that code:   [source] __declspec(dllexport) void func2(float* out, const float* a, const float* b, const uint32_t n) {   for (uint32_t i = 0; i < n; ++i)   {     out = std::sin(a + b);   } } [/source]   So, let's take a look at what that spits out....    [source] $LL4@func2:   vmovups ymm1, YMMWORD PTR[r14 + rsi * 4]   vaddps ymm0, ymm1, YMMWORD PTR[r12 + rsi * 4]   call __vdecl_sinf8   vmovups YMMWORD PTR[rdi + rsi * 4], ymm0   add esi, 8   cmp esi, r13d   jb SHORT $LL4@func2 [/source]   That's actually not *too* bad, it's inserted a nice SIMD function call here [__vdecl_sinf8] (Some older compilers would actually fail to SIMD this code at all, and end up doing 1 float at a time). The performance here though, depends on the implementation of __vdecl_sinf8, but we'll get to that later.   So let's look at the implementation I posted above:   [source] inline float sine(const float x) {   const float pi = 3.1415926535897932384626433832795f;   const float B = 4.0f / pi;   const float C = -4.0f / (pi * pi);     float y = B * x + C * x * abs(x);   const float P = 0.225f;   y = P * (y * std::abs(y) - y) + y;   return y; }   __declspec(dllexport) void func3(float* out, const float* a, const float* b, const uint32_t n) {   for (uint32_t i = 0; i < n; ++i)   {     out = sine(a + b);   } } [/source]   A quick look at the inner loop of this approach, and we have:   [source] $LL4@func3:   vmovups ymm1, YMMWORD PTR[rdx + r10 * 4]   vaddps ymm1, ymm1, YMMWORD PTR[r8 + r10 * 4]   vmovups ymm2, ymm1   vmulps ymm1, ymm6, ymm1   vmulps ymm4, ymm7, ymm2   vandps ymm0, ymm2, ymm5   vfnmadd231ps ymm1, ymm0, ymm4   vandps ymm2, ymm1, ymm5   vmovups ymm3, ymm1   vfmsub231ps ymm1, ymm2, ymm1   vfmadd231ps ymm3, ymm1, ymm8   vmovups YMMWORD PTR[r11 + r10 * 4], ymm3   lea eax, DWORD PTR[r10 + 8]   add r10d, 16   vmovups ymm1, YMMWORD PTR[rdx + rax * 4]   vaddps ymm1, ymm1, YMMWORD PTR[r8 + rax * 4]   vmovups ymm2, ymm1   vmulps ymm4, ymm2, ymm7   vandps ymm0, ymm2, ymm5   vmulps ymm1, ymm1, ymm6   vfnmadd231ps ymm1, ymm0, ymm4   vmovups ymm3, ymm1   vandps ymm2, ymm1, ymm5   vfmsub231ps ymm1, ymm2, ymm1   vfmadd231ps ymm3, ymm1, ymm8   vmovups YMMWORD PTR[r11 + rax * 4], ymm3   cmp r10d, ebx   jb SHORT $LL4@func3 [/source]   You'll notice it's unrolled the loop again, so the actual sine approximation boils down to:   [source] vmovups ymm2, ymm1 vmulps ymm1, ymm6, ymm1 vmulps ymm4, ymm7, ymm2 vandps ymm0, ymm2, ymm5 vfnmadd231ps ymm1, ymm0, ymm4 vandps ymm2, ymm1, ymm5 vmovups ymm3, ymm1 vfmsub231ps ymm1, ymm2, ymm1 vfmadd231ps ymm3, ymm1, ymm8 vmovups YMMWORD PTR[r11 + r10 * 4], ymm3 [/source]   Which isn't too bad at all (some nice tasty FMA goodness in there). One obvious problem is that it doesn't like numbers in outside the -Pi to +PI range. That may end up adding an extra 5 or so ops to fmod the argument so it fits nicely into that range (not really a biggy with AVX).    Right then, back to __vdecl_sinf8. Stepping through into the method with a disassembler, I end up facing this:   [source] 00007FF7C8CF7E00  push        rsi 00007FF7C8CF7E01  push        r14 00007FF7C8CF7E03  push        r15 00007FF7C8CF7E05  sub         rsp,2C0h 00007FF7C8CF7E0C  xor         esi,esi 00007FF7C8CF7E0E  vmovups     ymmword ptr [rsp+2A0h],ymm15 00007FF7C8CF7E17  vmovups     ymmword ptr [rsp+260h],ymm14 00007FF7C8CF7E20  vmovups     ymmword ptr [rsp+220h],ymm13 00007FF7C8CF7E29  vmovups     ymmword ptr [rsp+240h],ymm12 00007FF7C8CF7E32  vmovups     ymmword ptr [rsp+1C0h],ymm11 00007FF7C8CF7E3B  vmovups     ymmword ptr [rsp+1E0h],ymm10 00007FF7C8CF7E44  vmovups     ymmword ptr [rsp+1A0h],ymm9 00007FF7C8CF7E4D  vmovups     ymmword ptr [rsp+280h],ymm8 00007FF7C8CF7E56  vmovups     ymmword ptr [rsp+180h],ymm7 00007FF7C8CF7E5F  vmovups     ymmword ptr [rsp+200h],ymm6 00007FF7C8CF7E68  vpxor       xmm7,xmm7,xmm7 00007FF7C8CF7E6C  mov         qword ptr [rsp+0D8h],r13 00007FF7C8CF7E74  lea         r13,[rsp+11Fh] 00007FF7C8CF7E7C  vmovups     ymm6,ymmword ptr [__common_ssin_data+1000h (07FF7C8D02C00h)] 00007FF7C8CF7E84  and         r13,0FFFFFFFFFFFFFFC0h 00007FF7C8CF7E88  vandps      ymm2,ymm0,ymm6 00007FF7C8CF7E8C  vcmpgt_oqps ymm1,ymm2,ymmword ptr [__common_ssin_data+1040h (07FF7C8D02C40h)] 00007FF7C8CF7E95  vextractf128 xmm3,ymm1,1 00007FF7C8CF7E9B  vpackssdw   xmm4,xmm1,xmm3 00007FF7C8CF7E9F  vpacksswb   xmm5,xmm4,xmm7 00007FF7C8CF7EA3  vpmovmskb   eax,xmm5 00007FF7C8CF7EA7  test        al,al 00007FF7C8CF7EA9  jne         __avx_sinf8+219h (07FF7C8CF8019h) _B1_2: 00007FF7C8CF7EAF  vmovups     ymm8,ymmword ptr [__common_ssin_data+14C0h (07FF7C8D030C0h)] 00007FF7C8CF7EB7  vmulps      ymm3,ymm2,ymmword ptr [__common_ssin_data+1480h (07FF7C8D03080h)] 00007FF7C8CF7EBF  vaddps      ymm4,ymm3,ymm8 00007FF7C8CF7EC4  vandnps     ymm1,ymm6,ymm0 00007FF7C8CF7EC8  vpslld      xmm5,xmm4,1Fh 00007FF7C8CF7ECD  vsubps      ymm13,ymm4,ymm8 00007FF7C8CF7ED2  vmulps      ymm9,ymm13,ymmword ptr [__common_ssin_data+11C0h (07FF7C8D02DC0h)] 00007FF7C8CF7EDA  vmulps      ymm10,ymm13,ymmword ptr [__common_ssin_data+1200h (07FF7C8D02E00h)] 00007FF7C8CF7EE2  vmulps      ymm12,ymm13,ymmword ptr [__common_ssin_data+1240h (07FF7C8D02E40h)] 00007FF7C8CF7EEA  vmulps      ymm15,ymm13,ymmword ptr [__common_ssin_data+1280h (07FF7C8D02E80h)] 00007FF7C8CF7EF2  vsubps      ymm2,ymm2,ymm9 00007FF7C8CF7EF7  vsubps      ymm11,ymm2,ymm10 00007FF7C8CF7EFC  vsubps      ymm14,ymm11,ymm12 00007FF7C8CF7F01  vsubps      ymm2,ymm14,ymm15 00007FF7C8CF7F06  vmulps      ymm10,ymm2,ymm2 00007FF7C8CF7F0A  vextractf128 xmm6,ymm4,1 00007FF7C8CF7F10  vmulps      ymm4,ymm10,ymmword ptr [__common_ssin_data+1440h (07FF7C8D03040h)] 00007FF7C8CF7F18  vpslld      xmm7,xmm6,1Fh 00007FF7C8CF7F1D  vinsertf128 ymm3,ymm5,xmm7,1 00007FF7C8CF7F23  vaddps      ymm5,ymm4,ymmword ptr [__common_ssin_data+1400h (07FF7C8D03000h)] 00007FF7C8CF7F2B  vmulps      ymm6,ymm5,ymm10 00007FF7C8CF7F30  vaddps      ymm7,ymm6,ymmword ptr [__common_ssin_data+13C0h (07FF7C8D02FC0h)] 00007FF7C8CF7F38  vmulps      ymm8,ymm7,ymm10 00007FF7C8CF7F3D  vaddps      ymm9,ymm8,ymmword ptr [__common_ssin_data+1380h (07FF7C8D02F80h)] 00007FF7C8CF7F45  vmulps      ymm11,ymm9,ymm10 00007FF7C8CF7F4A  vxorps      ymm13,ymm2,ymm3 00007FF7C8CF7F4E  vmulps      ymm12,ymm11,ymm13 00007FF7C8CF7F53  vaddps      ymm14,ymm12,ymm13 00007FF7C8CF7F58  vxorps      ymm1,ymm14,ymm1 _B1_3: 00007FF7C8CF7F5C  test        sil,sil 00007FF7C8CF7F5F  jne         __avx_sinf8+1D4h (07FF7C8CF7FD4h) _B1_4: 00007FF7C8CF7F61  vmovups     ymm6,ymmword ptr [rsp+200h] 00007FF7C8CF7F6A  vmovups     ymm7,ymmword ptr [rsp+180h] 00007FF7C8CF7F73  vmovups     ymm8,ymmword ptr [rsp+280h] 00007FF7C8CF7F7C  vmovups     ymm9,ymmword ptr [rsp+1A0h] 00007FF7C8CF7F85  vmovups     ymm10,ymmword ptr [rsp+1E0h] 00007FF7C8CF7F8E  vmovups     ymm11,ymmword ptr [rsp+1C0h] 00007FF7C8CF7F97  vmovups     ymm12,ymmword ptr [rsp+240h] 00007FF7C8CF7FA0  vmovups     ymm13,ymmword ptr [rsp+220h] 00007FF7C8CF7FA9  vmovups     ymm14,ymmword ptr [rsp+260h] 00007FF7C8CF7FB2  vmovups     ymm15,ymmword ptr [rsp+2A0h] 00007FF7C8CF7FBB  mov         r13,qword ptr [rsp+0D8h] 00007FF7C8CF7FC3  vmovaps     ymm0,ymm1 00007FF7C8CF7FC7  add         rsp,2C0h 00007FF7C8CF7FCE  pop         r15 00007FF7C8CF7FD0  pop         r14 00007FF7C8CF7FD2  pop         rsi 00007FF7C8CF7FD3  ret _B1_5: 00007FF7C8CF7FD4  vmovups     ymmword ptr [r13],ymm0 00007FF7C8CF7FDA  vmovups     ymmword ptr [r13+40h],ymm1 00007FF7C8CF7FE0  test        esi,esi 00007FF7C8CF7FE2  je          __avx_sinf8+161h (07FF7C8CF7F61h) _B1_7: 00007FF7C8CF7FE8  xor         r14d,r14d _B1_8: 00007FF7C8CF7FEB  bt          esi,r14d 00007FF7C8CF7FEF  jb          __avx_sinf8+205h (07FF7C8CF8005h) _B1_9: 00007FF7C8CF7FF1  inc         r14d 00007FF7C8CF7FF4  cmp         r14d,20h 00007FF7C8CF7FF8  jl          __avx_sinf8+1EBh (07FF7C8CF7FEBh) _B1_10: 00007FF7C8CF7FFA  vmovups     ymm1,ymmword ptr [r13+40h] 00007FF7C8CF8000  jmp         __avx_sinf8+161h (07FF7C8CF7F61h) _B1_11: 00007FF7C8CF8005  vzeroupper 00007FF7C8CF8008  lea         rcx,[r13+r14*4] 00007FF7C8CF800D  lea         rdx,[r13+r14*4+40h] 00007FF7C8CF8012  call        __common_ssin_cout_rare (07FF7C8CF88E0h) 00007FF7C8CF8017  jmp         __avx_sinf8+1F1h (07FF7C8CF7FF1h) _B1_12: 00007FF7C8CF8019  vmovups     ymm10,ymmword ptr [__common_ssin_data+1080h (07FF7C8D02C80h)] 00007FF7C8CF8021  mov         edx,7F800000h 00007FF7C8CF8026  vmovups     ymmword ptr [r13],ymm0 00007FF7C8CF802C  vmovd       xmm8,edx 00007FF7C8CF8030  vpshufd     xmm13,xmm8,0 00007FF7C8CF8036  vandps      ymm6,ymm10,ymm2 00007FF7C8CF803A  mov         edx,0FFh 00007FF7C8CF803F  vcmpeqps    ymm1,ymm6,ymm10 00007FF7C8CF8045  lea         rax,[__common_ssin_reduction_data (07FF7C8D01000h)] 00007FF7C8CF804C  vpand       xmm4,xmm13,xmm0 00007FF7C8CF8050  vextractf128 xmm15,ymm0,1 00007FF7C8CF8056  vpsrld      xmm14,xmm4,17h 00007FF7C8CF805B  vpand       xmm9,xmm13,xmm15 00007FF7C8CF8060  vpslld      xmm12,xmm14,1 00007FF7C8CF8066  vpsrld      xmm11,xmm9,17h 00007FF7C8CF806C  vpaddd      xmm5,xmm12,xmm14 00007FF7C8CF8071  vpslld      xmm6,xmm11,1 00007FF7C8CF8077  vpaddd      xmm10,xmm6,xmm11 00007FF7C8CF807C  vpslld      xmm9,xmm10,2 00007FF7C8CF8082  vmovd       r14d,xmm9 00007FF7C8CF8087  vmovups     xmmword ptr [rsp+20h],xmm0 00007FF7C8CF808D  vmovups     xmmword ptr [rsp+30h],xmm15 00007FF7C8CF8093  vpextrd     r15d,xmm9,1 00007FF7C8CF8099  vpextrd     esi,xmm9,2 00007FF7C8CF809F  vextractf128 xmm3,ymm1,1 00007FF7C8CF80A5  vpackssdw   xmm2,xmm1,xmm3 00007FF7C8CF80A9  vpslld      xmm1,xmm5,2 00007FF7C8CF80AE  vpacksswb   xmm7,xmm2,xmm7 00007FF7C8CF80B2  vpmovmskb   ecx,xmm7 00007FF7C8CF80B6  vmovd       r8d,xmm1 00007FF7C8CF80BB  vmovd       xmm12,dword ptr [r14+rax] 00007FF7C8CF80C1  vmovd       xmm14,dword ptr [r15+rax] 00007FF7C8CF80C7  mov         dword ptr [rsp+0D0h],ecx 00007FF7C8CF80CE  vpextrd     ecx,xmm9,3 00007FF7C8CF80D4  vpextrd     r10d,xmm1,2 00007FF7C8CF80DA  vpextrd     r11d,xmm1,3 00007FF7C8CF80E0  vpextrd     r9d,xmm1,1 00007FF7C8CF80E6  vmovd       xmm5,dword ptr [rsi+rax] 00007FF7C8CF80EB  vmovd       xmm6,dword ptr [rcx+rax] 00007FF7C8CF80F0  vmovd       xmm7,dword ptr [r10+rax] 00007FF7C8CF80F6  vmovd       xmm8,dword ptr [r11+rax] 00007FF7C8CF80FC  vpunpcklqdq xmm11,xmm12,xmm14 00007FF7C8CF8101  vpunpcklqdq xmm10,xmm5,xmm6 00007FF7C8CF8105  vmovd       xmm5,dword ptr [rsi+rax+4] 00007FF7C8CF810B  vmovd       xmm6,dword ptr [rcx+rax+4] 00007FF7C8CF8111  vpunpcklqdq xmm13,xmm7,xmm8 00007FF7C8CF8116  vshufps     xmm8,xmm11,xmm10,88h 00007FF7C8CF811C  vmovd       xmm12,dword ptr [r14+rax+4] 00007FF7C8CF8123  vmovd       xmm14,dword ptr [r15+rax+4] 00007FF7C8CF812A  vpunpcklqdq xmm10,xmm5,xmm6 00007FF7C8CF812E  vmovd       xmm5,dword ptr [r15+rax+8] 00007FF7C8CF8135  mov         r15d,7FFFFFh 00007FF7C8CF813B  vmovd       xmm3,dword ptr [r8+rax] 00007FF7C8CF8141  vmovd       xmm2,dword ptr [r9+rax] 00007FF7C8CF8147  vpunpcklqdq xmm11,xmm12,xmm14 00007FF7C8CF814C  vmovd       xmm14,dword ptr [r14+rax+8] 00007FF7C8CF8153  mov         r14d,800000h 00007FF7C8CF8159  vpunpcklqdq xmm4,xmm3,xmm2 00007FF7C8CF815D  vmovd       xmm1,dword ptr [r8+rax+4] 00007FF7C8CF8164  vmovd       xmm3,dword ptr [r9+rax+4] 00007FF7C8CF816B  vmovd       xmm2,dword ptr [r10+rax+4] 00007FF7C8CF8172  vmovd       xmm7,dword ptr [r11+rax+4] 00007FF7C8CF8179  vshufps     xmm13,xmm4,xmm13,88h 00007FF7C8CF817F  vpunpcklqdq xmm4,xmm1,xmm3 00007FF7C8CF8183  vpunpcklqdq xmm9,xmm2,xmm7 00007FF7C8CF8187  vmovd       xmm7,dword ptr [r11+rax+8] 00007FF7C8CF818E  mov         r11d,0FFFFh 00007FF7C8CF8194  vmovd       xmm1,dword ptr [r8+rax+8] 00007FF7C8CF819B  mov         r8d,47400000h 00007FF7C8CF81A1  vmovd       xmm3,dword ptr [r9+rax+8] 00007FF7C8CF81A8  mov         r9d,3F800000h 00007FF7C8CF81AE  vmovd       xmm2,dword ptr [r10+rax+8] 00007FF7C8CF81B5  mov         r10d,80000000h 00007FF7C8CF81BB  vshufps     xmm4,xmm4,xmm9,88h 00007FF7C8CF81C1  vpunpcklqdq xmm9,xmm1,xmm3 00007FF7C8CF81C5  vpunpcklqdq xmm12,xmm2,xmm7 00007FF7C8CF81C9  vmovd       xmm7,r15d 00007FF7C8CF81CE  vshufps     xmm1,xmm9,xmm12,88h 00007FF7C8CF81D4  vmovd       xmm9,r14d 00007FF7C8CF81D9  vpshufd     xmm12,xmm7,0 00007FF7C8CF81DE  mov         r15d,28800000h 00007FF7C8CF81E4  vpunpcklqdq xmm3,xmm14,xmm5 00007FF7C8CF81E8  vpand       xmm0,xmm12,xmm0 00007FF7C8CF81EC  vpshufd     xmm5,xmm9,0 00007FF7C8CF81F2  vpand       xmm15,xmm12,xmm15 00007FF7C8CF81F7  vshufps     xmm10,xmm11,xmm10,88h 00007FF7C8CF81FD  vpaddd      xmm14,xmm0,xmm5 00007FF7C8CF8201  vmovd       xmm6,dword ptr [rsi+rax+8] 00007FF7C8CF8207  vmovd       xmm0,r11d 00007FF7C8CF820C  vmovd       xmm11,dword ptr [rcx+rax+8] 00007FF7C8CF8212  lea         rax,[__common_ssin_data (07FF7C8D01C00h)] 00007FF7C8CF8219  mov         r14d,3FFFFh 00007FF7C8CF821F  vpunpcklqdq xmm2,xmm6,xmm11 00007FF7C8CF8224  vpaddd      xmm6,xmm15,xmm5 00007FF7C8CF8228  vpshufd     xmm15,xmm0,0 00007FF7C8CF822D  vpsrld      xmm11,xmm4,10h 00007FF7C8CF8232  vpand       xmm7,xmm13,xmm15 00007FF7C8CF8237  vpand       xmm9,xmm14,xmm15 00007FF7C8CF823C  vmovups     xmmword ptr [rsp+50h],xmm8 00007FF7C8CF8242  vpand       xmm12,xmm8,xmm15 00007FF7C8CF8247  vshufps     xmm3,xmm3,xmm2,88h 00007FF7C8CF824C  vpsrld      xmm8,xmm10,10h 00007FF7C8CF8252  vpand       xmm0,xmm10,xmm15 00007FF7C8CF8257  vpsrld      xmm2,xmm1,10h 00007FF7C8CF825C  vpsrld      xmm10,xmm14,10h 00007FF7C8CF8262  vpand       xmm14,xmm6,xmm15 00007FF7C8CF8267  vmovdqu     xmmword ptr [rsp+60h],xmm7 00007FF7C8CF826D  vpand       xmm5,xmm4,xmm15 00007FF7C8CF8272  vpmulld     xmm7,xmm9,xmm7 00007FF7C8CF8277  vpand       xmm1,xmm1,xmm15 00007FF7C8CF827C  vmovdqu     xmmword ptr [rsp+0B0h],xmm7 00007FF7C8CF8285  vpsrld      xmm4,xmm3,10h 00007FF7C8CF828A  vpmulld     xmm7,xmm10,xmm2 00007FF7C8CF828F  vpand       xmm3,xmm3,xmm15 00007FF7C8CF8294  vpmulld     xmm2,xmm9,xmm2 00007FF7C8CF8299  mov         r11d,34000000h 00007FF7C8CF829F  vmovups     xmmword ptr [rsp+40h],xmm13 00007FF7C8CF82A5  vpsrld      xmm13,xmm6,10h 00007FF7C8CF82AA  vpmulld     xmm6,xmm14,xmm12 00007FF7C8CF82AF  vpsrld      xmm2,xmm2,10h 00007FF7C8CF82B4  vmovdqu     xmmword ptr [rsp+70h],xmm12 00007FF7C8CF82BA  vpaddd      xmm7,xmm7,xmm2 00007FF7C8CF82BE  vmovdqu     xmmword ptr [rsp+90h],xmm8 00007FF7C8CF82C7  mov         ecx,0B795777Ah 00007FF7C8CF82CC  vmovdqu     xmmword ptr [rsp+0A0h],xmm0 00007FF7C8CF82D5  mov         esi,7FFFFFFFh 00007FF7C8CF82DA  vmovdqu     xmmword ptr [rsp+0C0h],xmm6 00007FF7C8CF82E3  vpmulld     xmm12,xmm14,xmm8 00007FF7C8CF82E8  vpmulld     xmm6,xmm9,xmm5 00007FF7C8CF82ED  vpmulld     xmm8,xmm14,xmm0 00007FF7C8CF82F2  vpmulld     xmm0,xmm10,xmm1 00007FF7C8CF82F7  vpand       xmm2,xmm8,xmm15 00007FF7C8CF82FC  vpsrld      xmm1,xmm0,10h 00007FF7C8CF8301  vpand       xmm0,xmm6,xmm15 00007FF7C8CF8306  vpaddd      xmm0,xmm0,xmm7 00007FF7C8CF830A  vpsrld      xmm6,xmm6,10h 00007FF7C8CF830F  vpaddd      xmm7,xmm1,xmm0 00007FF7C8CF8313  vpsrld      xmm8,xmm8,10h 00007FF7C8CF8319  vpmulld     xmm0,xmm13,xmm3 00007FF7C8CF831E  vpmulld     xmm3,xmm13,xmm4 00007FF7C8CF8323  vpsrld      xmm0,xmm0,10h 00007FF7C8CF8328  vpmulld     xmm4,xmm14,xmm4 00007FF7C8CF832D  vpsrld      xmm1,xmm4,10h 00007FF7C8CF8332  vpaddd      xmm3,xmm3,xmm1 00007FF7C8CF8336  vpsrld      xmm1,xmm7,10h 00007FF7C8CF833B  vmovdqu     xmmword ptr [rsp+80h],xmm11 00007FF7C8CF8344  vpaddd      xmm2,xmm2,xmm3 00007FF7C8CF8348  vpmulld     xmm11,xmm9,xmm11 00007FF7C8CF834D  vpaddd      xmm4,xmm0,xmm2 00007FF7C8CF8351  vpmulld     xmm5,xmm10,xmm5 00007FF7C8CF8356  vpand       xmm0,xmm11,xmm15 00007FF7C8CF835B  vpaddd      xmm3,xmm5,xmm6 00007FF7C8CF835F  vpand       xmm2,xmm12,xmm15 00007FF7C8CF8364  vpaddd      xmm0,xmm0,xmm3 00007FF7C8CF8368  vpsrld      xmm3,xmm4,10h 00007FF7C8CF836D  vpaddd      xmm6,xmm1,xmm0 00007FF7C8CF8371  vpsrld      xmm11,xmm11,10h 00007FF7C8CF8377  vpmulld     xmm1,xmm13,xmmword ptr [rsp+0A0h] 00007FF7C8CF8381  vpsrld      xmm5,xmm6,10h 00007FF7C8CF8386  vpaddd      xmm0,xmm1,xmm8 00007FF7C8CF838B  vpsrld      xmm12,xmm12,10h 00007FF7C8CF8391  vpaddd      xmm1,xmm2,xmm0 00007FF7C8CF8395  vpand       xmm7,xmm7,xmm15 00007FF7C8CF839A  vpmulld     xmm2,xmm10,xmmword ptr [rsp+80h] 00007FF7C8CF83A4  vpaddd      xmm8,xmm3,xmm1 00007FF7C8CF83A8  vmovdqu     xmm3,xmmword ptr [rsp+0B0h] 00007FF7C8CF83B1  vpaddd      xmm0,xmm2,xmm11 00007FF7C8CF83B6  vpand       xmm1,xmm3,xmm15 00007FF7C8CF83BB  vpsrld      xmm2,xmm8,10h 00007FF7C8CF83C1  vpaddd      xmm1,xmm1,xmm0 00007FF7C8CF83C5  vpslld      xmm8,xmm8,10h 00007FF7C8CF83CB  vpmulld     xmm11,xmm13,xmmword ptr [rsp+90h] 00007FF7C8CF83D5  vpaddd      xmm5,xmm5,xmm1 00007FF7C8CF83D9  vmovdqu     xmm1,xmmword ptr [rsp+0C0h] 00007FF7C8CF83E2  vpaddd      xmm11,xmm11,xmm12 00007FF7C8CF83E7  vpand       xmm0,xmm1,xmm15 00007FF7C8CF83EC  vpsrld      xmm12,xmm5,10h 00007FF7C8CF83F1  vpaddd      xmm0,xmm0,xmm11 00007FF7C8CF83F6  vpsrld      xmm1,xmm1,10h 00007FF7C8CF83FB  vmovups     xmm11,xmmword ptr [rsp+40h] 00007FF7C8CF8401  vpaddd      xmm2,xmm2,xmm0 00007FF7C8CF8405  vpsrld      xmm0,xmm11,10h 00007FF7C8CF840B  vpand       xmm5,xmm5,xmm15 00007FF7C8CF8410  vpmulld     xmm10,xmm10,xmmword ptr [rsp+60h] 00007FF7C8CF8417  vmovd       xmm11,r9d 00007FF7C8CF841C  vpmulld     xmm9,xmm9,xmm0 00007FF7C8CF8421  vpsrld      xmm0,xmm3,10h 00007FF7C8CF8426  vpand       xmm9,xmm9,xmm15 00007FF7C8CF842B  vpaddd      xmm10,xmm10,xmm0 00007FF7C8CF842F  vpaddd      xmm3,xmm9,xmm10 00007FF7C8CF8434  vpsrld      xmm0,xmm2,10h 00007FF7C8CF8439  vpaddd      xmm9,xmm12,xmm3 00007FF7C8CF843D  vpand       xmm2,xmm2,xmm15 00007FF7C8CF8442  vmovups     xmm3,xmmword ptr [rsp+50h] 00007FF7C8CF8448  vpslld      xmm12,xmm9,10h 00007FF7C8CF844E  vpsrld      xmm9,xmm3,10h 00007FF7C8CF8453  vpaddd      xmm10,xmm12,xmm5 00007FF7C8CF8457  vpmulld     xmm13,xmm13,xmmword ptr [rsp+70h] 00007FF7C8CF845E  mov         r9d,40C90FDBh 00007FF7C8CF8464  vpmulld     xmm14,xmm14,xmm9 00007FF7C8CF8469  vpaddd      xmm13,xmm13,xmm1 00007FF7C8CF846D  vpand       xmm3,xmm14,xmm15 00007FF7C8CF8472  vpand       xmm15,xmm4,xmm15 00007FF7C8CF8477  vpaddd      xmm9,xmm3,xmm13 00007FF7C8CF847C  vmovd       xmm4,r10d 00007FF7C8CF8481  vpaddd      xmm0,xmm0,xmm9 00007FF7C8CF8486  vpslld      xmm14,xmm6,10h 00007FF7C8CF848B  vpshufd     xmm6,xmm4,0 00007FF7C8CF8490  vpslld      xmm12,xmm0,10h 00007FF7C8CF8495  vpand       xmm5,xmm6,xmmword ptr [rsp+20h] 00007FF7C8CF849B  vpaddd      xmm9,xmm14,xmm7 00007FF7C8CF849F  vpshufd     xmm7,xmm11,0 00007FF7C8CF84A5  vpaddd      xmm0,xmm12,xmm2 00007FF7C8CF84A9  vpand       xmm14,xmm6,xmmword ptr [rsp+30h] 00007FF7C8CF84AF  vpsrld      xmm1,xmm10,9 00007FF7C8CF84B5  vpxor       xmm3,xmm5,xmm7 00007FF7C8CF84B9  vmovd       xmm12,r8d 00007FF7C8CF84BE  vpaddd      xmm15,xmm8,xmm15 00007FF7C8CF84C3  vpsrld      xmm8,xmm0,9 00007FF7C8CF84C8  vpxor       xmm4,xmm14,xmm7 00007FF7C8CF84CC  vpor        xmm2,xmm1,xmm3 00007FF7C8CF84D0  vpshufd     xmm6,xmm12,0 00007FF7C8CF84D6  vpor        xmm13,xmm8,xmm4 00007FF7C8CF84DA  vmovd       xmm12,r15d 00007FF7C8CF84DF  mov         r10d,1FFh 00007FF7C8CF84E5  mov         r8d,40C91000h 00007FF7C8CF84EB  mov         r15d,35800000h 00007FF7C8CF84F1  vinsertf128 ymm1,ymm2,xmm13,1 00007FF7C8CF84F7  vmovd       xmm2,edx 00007FF7C8CF84FB  vinsertf128 ymm11,ymm6,xmm6,1 00007FF7C8CF8501  mov         edx,0FFFFF000h 00007FF7C8CF8506  vaddps      ymm4,ymm11,ymm1 00007FF7C8CF850A  vpshufd     xmm8,xmm2,0 00007FF7C8CF850F  vsubps      ymm3,ymm4,ymm11 00007FF7C8CF8514  vsubps      ymm13,ymm1,ymm3 00007FF7C8CF8518  vpshufd     xmm1,xmm12,0 00007FF7C8CF851E  vmovd       xmm12,r14d 00007FF7C8CF8523  vpxor       xmm2,xmm5,xmm1 00007FF7C8CF8527  vpxor       xmm3,xmm14,xmm1 00007FF7C8CF852B  vpshufd     xmm1,xmm12,0 00007FF7C8CF8531  vpand       xmm6,xmm1,xmm9 00007FF7C8CF8536  vpand       xmm1,xmm1,xmm15 00007FF7C8CF853B  vpslld      xmm11,xmm6,5 00007FF7C8CF8540  vpslld      xmm12,xmm1,5 00007FF7C8CF8545  vpor        xmm11,xmm11,xmm2 00007FF7C8CF8549  vpor        xmm6,xmm12,xmm3 00007FF7C8CF854D  vpsrld      xmm9,xmm9,12h 00007FF7C8CF8553  vpsrld      xmm15,xmm15,12h 00007FF7C8CF8559  vinsertf128 ymm3,ymm2,xmm3,1 00007FF7C8CF855F  vmovd       xmm2,r10d 00007FF7C8CF8564  vinsertf128 ymm1,ymm11,xmm6,1 00007FF7C8CF856A  vmovd       xmm11,ecx 00007FF7C8CF856E  vpshufd     xmm6,xmm2,0 00007FF7C8CF8573  vmovd       xmm2,r9d 00007FF7C8CF8578  vsubps      ymm12,ymm1,ymm3 00007FF7C8CF857C  vmovd       xmm1,r11d 00007FF7C8CF8581  vpand       xmm10,xmm6,xmm10 00007FF7C8CF8586  vpand       xmm0,xmm6,xmm0 00007FF7C8CF858A  vpshufd     xmm3,xmm1,0 00007FF7C8CF858F  vpslld      xmm10,xmm10,0Eh 00007FF7C8CF8595  vpxor       xmm5,xmm5,xmm3 00007FF7C8CF8599  vpor        xmm10,xmm10,xmm9 00007FF7C8CF859E  vpor        xmm1,xmm10,xmm5 00007FF7C8CF85A2  vpslld      xmm10,xmm0,0Eh 00007FF7C8CF85A7  vpxor       xmm14,xmm14,xmm3 00007FF7C8CF85AB  vpor        xmm10,xmm10,xmm15 00007FF7C8CF85B0  vpor        xmm0,xmm10,xmm14 00007FF7C8CF85B5  vinsertf128 ymm3,ymm1,xmm0,1 00007FF7C8CF85BB  vinsertf128 ymm14,ymm5,xmm14,1 00007FF7C8CF85C1  vmovd       xmm5,r8d 00007FF7C8CF85C6  vsubps      ymm10,ymm3,ymm14 00007FF7C8CF85CB  vpshufd     xmm6,xmm5,0 00007FF7C8CF85D0  vaddps      ymm9,ymm13,ymm10 00007FF7C8CF85D5  vsubps      ymm0,ymm13,ymm9 00007FF7C8CF85DA  vpshufd     xmm13,xmm2,0 00007FF7C8CF85DF  vaddps      ymm1,ymm10,ymm0 00007FF7C8CF85E3  vmovd       xmm10,edx 00007FF7C8CF85E7  vpshufd     xmm0,xmm10,0 00007FF7C8CF85ED  vaddps      ymm15,ymm1,ymm12 00007FF7C8CF85F2  vpshufd     xmm12,xmm11,0 00007FF7C8CF85F8  vinsertf128 ymm1,ymm0,xmm0,1 00007FF7C8CF85FE  vandps      ymm5,ymm9,ymm1 00007FF7C8CF8602  vsubps      ymm9,ymm9,ymm5 00007FF7C8CF8606  vinsertf128 ymm13,ymm13,xmm13,1 00007FF7C8CF860C  vinsertf128 ymm2,ymm6,xmm6,1 00007FF7C8CF8612  vmovd       xmm6,r15d 00007FF7C8CF8617  vinsertf128 ymm3,ymm12,xmm12,1 00007FF7C8CF861D  vmulps      ymm10,ymm2,ymm9 00007FF7C8CF8622  vmulps      ymm1,ymm3,ymm5 00007FF7C8CF8626  vmulps      ymm14,ymm13,ymm15 00007FF7C8CF862B  vmulps      ymm3,ymm3,ymm9 00007FF7C8CF8630  vmulps      ymm0,ymm2,ymm5 00007FF7C8CF8634  vmovd       xmm2,esi 00007FF7C8CF8638  vpshufd     xmm5,xmm2,0 00007FF7C8CF863D  vpshufd     xmm9,xmm6,0 00007FF7C8CF8642  vaddps      ymm15,ymm10,ymm1 00007FF7C8CF8646  vaddps      ymm10,ymm14,ymm3 00007FF7C8CF864A  vaddps      ymm3,ymm15,ymm10 00007FF7C8CF864F  vaddps      ymm1,ymm0,ymm3 00007FF7C8CF8653  vsubps      ymm0,ymm0,ymm1 00007FF7C8CF8657  vaddps      ymm10,ymm0,ymm3 00007FF7C8CF865B  vmovups     ymm0,ymmword ptr [r13] 00007FF7C8CF8661  mov         esi,dword ptr [rsp+0D0h] 00007FF7C8CF8668  vextractf128 xmm7,ymm4,1 00007FF7C8CF866E  vpand       xmm4,xmm4,xmm8 00007FF7C8CF8673  vpslld      xmm2,xmm4,4 00007FF7C8CF8678  vpand       xmm7,xmm7,xmm8 00007FF7C8CF867D  vmovd       r15d,xmm2 00007FF7C8CF8682  vpextrd     r14d,xmm2,1 00007FF7C8CF8688  vpextrd     r11d,xmm2,2 00007FF7C8CF868E  vpextrd     r10d,xmm2,3 00007FF7C8CF8694  vmovd       xmm8,dword ptr [r15+rax] 00007FF7C8CF869A  vmovd       xmm2,dword ptr [r14+rax] 00007FF7C8CF86A0  vinsertf128 ymm13,ymm9,xmm9,1 00007FF7C8CF86A6  vpslld      xmm9,xmm7,4 00007FF7C8CF86AB  vmovd       r9d,xmm9 00007FF7C8CF86B0  vmovd       xmm4,dword ptr [r10+rax] 00007FF7C8CF86B6  vpextrd     r8d,xmm9,1 00007FF7C8CF86BC  vpextrd     ecx,xmm9,2 00007FF7C8CF86C2  vpextrd     edx,xmm9,3 00007FF7C8CF86C8  vmovd       xmm9,dword ptr [r10+rax+4] 00007FF7C8CF86CF  vinsertf128 ymm11,ymm5,xmm5,1 00007FF7C8CF86D5  vandps      ymm12,ymm0,ymm11 00007FF7C8CF86DA  vcmpgt_oqps ymm3,ymm12,ymm13 00007FF7C8CF86E0  vcmple_oqps ymm14,ymm12,ymm13 00007FF7C8CF86E6  vpunpcklqdq xmm5,xmm8,xmm2 00007FF7C8CF86EA  vmovd       xmm8,dword ptr [r11+rax] 00007FF7C8CF86F0  vpunpcklqdq xmm6,xmm8,xmm4 00007FF7C8CF86F4  vmovd       xmm8,dword ptr [r11+rax+4] 00007FF7C8CF86FB  vandps      ymm15,ymm14,ymm0 00007FF7C8CF86FF  vandps      ymm1,ymm3,ymm1 00007FF7C8CF8703  vshufps     xmm7,xmm5,xmm6,88h 00007FF7C8CF8708  vmovd       xmm11,dword ptr [r9+rax] 00007FF7C8CF870E  vmovd       xmm12,dword ptr [r8+rax] 00007FF7C8CF8714  vmovd       xmm13,dword ptr [rcx+rax] 00007FF7C8CF8719  vmovd       xmm14,dword ptr [rdx+rax] 00007FF7C8CF871E  vmovd       xmm5,dword ptr [r15+rax+4] 00007FF7C8CF8725  vmovd       xmm6,dword ptr [r14+rax+4] 00007FF7C8CF872C  vorps       ymm1,ymm15,ymm1 00007FF7C8CF8730  vpunpcklqdq xmm15,xmm11,xmm12 00007FF7C8CF8735  vpunpcklqdq xmm2,xmm13,xmm14 00007FF7C8CF873A  vpunpcklqdq xmm11,xmm5,xmm6 00007FF7C8CF873E  vpunpcklqdq xmm12,xmm8,xmm9 00007FF7C8CF8743  vshufps     xmm4,xmm15,xmm2,88h 00007FF7C8CF8748  vshufps     xmm13,xmm11,xmm12,88h 00007FF7C8CF874E  vandps      ymm10,ymm3,ymm10 00007FF7C8CF8753  vmulps      ymm3,ymm1,ymm1 00007FF7C8CF8757  vinsertf128 ymm2,ymm7,xmm4,1 00007FF7C8CF875D  vmovd       xmm4,dword ptr [r9+rax+4] _B1_15: 00007FF7C8CF8764  lea         rax,[__common_ssin_reduction_data (07FF7C8D01000h)] 00007FF7C8CF876B  vmovd       xmm11,dword ptr [r8+rax+4] 00007FF7C8CF8772  vmovd       xmm12,dword ptr [rcx+rax+4] 00007FF7C8CF8778  vmovd       xmm14,dword ptr [rdx+rax+4] 00007FF7C8CF877E  vpunpcklqdq xmm15,xmm4,xmm11 00007FF7C8CF8783  vpunpcklqdq xmm8,xmm12,xmm14 00007FF7C8CF8788  vshufps     xmm8,xmm15,xmm8,88h 00007FF7C8CF878E  vinsertf128 ymm9,ymm13,xmm8,1 00007FF7C8CF8794  vmovd       xmm13,dword ptr [r15+rax+0Ch] 00007FF7C8CF879B  vmovd       xmm8,dword ptr [r14+rax+0Ch] 00007FF7C8CF87A2  vmovd       xmm6,dword ptr [r11+rax+0Ch] 00007FF7C8CF87A9  vmovd       xmm5,dword ptr [r10+rax+0Ch] 00007FF7C8CF87B0  vpunpcklqdq xmm4,xmm13,xmm8 00007FF7C8CF87B5  vmovd       xmm12,dword ptr [r9+rax+0Ch] 00007FF7C8CF87BC  vmovd       xmm13,dword ptr [r8+rax+0Ch] 00007FF7C8CF87C3  vmovd       xmm14,dword ptr [rcx+rax+0Ch] 00007FF7C8CF87C9  vmovd       xmm15,dword ptr [rdx+rax+0Ch] 00007FF7C8CF87CF  vpunpcklqdq xmm7,xmm6,xmm5 00007FF7C8CF87D3  vpunpcklqdq xmm8,xmm12,xmm13 00007FF7C8CF87D8  vpunpcklqdq xmm6,xmm14,xmm15 00007FF7C8CF87DD  vshufps     xmm11,xmm4,xmm7,88h 00007FF7C8CF87E2  vshufps     xmm5,xmm8,xmm6,88h 00007FF7C8CF87E7  vmulps      ymm15,ymm2,ymm1 00007FF7C8CF87EB  vinsertf128 ymm7,ymm11,xmm5,1 00007FF7C8CF87F1  vmulps      ymm12,ymm1,ymm7 00007FF7C8CF87F5  vaddps      ymm13,ymm9,ymm12 00007FF7C8CF87FA  vsubps      ymm4,ymm9,ymm13 00007FF7C8CF87FF  vaddps      ymm8,ymm15,ymm13 00007FF7C8CF8804  vaddps      ymm5,ymm4,ymm12 00007FF7C8CF8809  vsubps      ymm14,ymm13,ymm8 00007FF7C8CF880E  vmovd       xmm13,dword ptr [r10+rax+8] 00007FF7C8CF8815  vmulps      ymm4,ymm3,ymmword ptr [__common_ssin_data+1100h (07FF7C8D02D00h)] 00007FF7C8CF881D  vaddps      ymm6,ymm14,ymm15 00007FF7C8CF8822  vaddps      ymm11,ymm4,ymmword ptr [__common_ssin_data+10C0h (07FF7C8D02CC0h)] 00007FF7C8CF882A  vaddps      ymm4,ymm2,ymm7 00007FF7C8CF882E  vaddps      ymm6,ymm6,ymm5 00007FF7C8CF8832  vmovd       xmm7,dword ptr [r15+rax+8] 00007FF7C8CF8839  vmulps      ymm12,ymm11,ymm3 00007FF7C8CF883D  vmulps      ymm2,ymm3,ymmword ptr [__common_ssin_data+1180h (07FF7C8D02D80h)] 00007FF7C8CF8845  vmovd       xmm11,dword ptr [r14+rax+8] 00007FF7C8CF884C  vpunpcklqdq xmm14,xmm7,xmm11 00007FF7C8CF8851  vmovd       xmm7,dword ptr [r9+rax+8] 00007FF7C8CF8858  vmovd       xmm11,dword ptr [r8+rax+8] 00007FF7C8CF885F  vmulps      ymm5,ymm12,ymm1 00007FF7C8CF8863  vmulps      ymm1,ymm1,ymm9 00007FF7C8CF8868  vmovd       xmm12,dword ptr [r11+rax+8] 00007FF7C8CF886F  vpunpcklqdq xmm15,xmm12,xmm13 00007FF7C8CF8874  vaddps      ymm2,ymm2,ymmword ptr [__common_ssin_data+1140h (07FF7C8D02D40h)] 00007FF7C8CF887C  vpunpcklqdq xmm13,xmm7,xmm11 00007FF7C8CF8881  vmovd       xmm7,dword ptr [rcx+rax+8] 00007FF7C8CF8887  vsubps      ymm1,ymm4,ymm1 00007FF7C8CF888B  vmovd       xmm12,dword ptr [rdx+rax+8] 00007FF7C8CF8891  vmulps      ymm3,ymm2,ymm3 00007FF7C8CF8895  vmulps      ymm10,ymm10,ymm1 00007FF7C8CF8899  vshufps     xmm2,xmm14,xmm15,88h 00007FF7C8CF889F  vpunpcklqdq xmm14,xmm7,xmm12 00007FF7C8CF88A4  vshufps     xmm15,xmm13,xmm14,88h 00007FF7C8CF88AA  vmulps      ymm3,ymm3,ymm9 00007FF7C8CF88AF  vmulps      ymm5,ymm5,ymm1 00007FF7C8CF88B3  vaddps      ymm6,ymm5,ymm6 00007FF7C8CF88B7  vinsertf128 ymm2,ymm2,xmm15,1 00007FF7C8CF88BD  vaddps      ymm10,ymm10,ymm2 00007FF7C8CF88C1  vaddps      ymm4,ymm3,ymm10 00007FF7C8CF88C6  vaddps      ymm7,ymm4,ymm6 00007FF7C8CF88CA  vaddps      ymm1,ymm8,ymm7 00007FF7C8CF88CE  jmp         __avx_sinf8+15Ch (07FF7C8CF7F5Ch) 00007FF7C8CF88D3  nop         word ptr [rax+rax] _B1_1: 00007FF7C8CF88E0  sub         rsp,28h 00007FF7C8CF88E4  mov         r8d,dword ptr [rcx] 00007FF7C8CF88E7  movzx       eax,word ptr [rcx+2] 00007FF7C8CF88EB  mov         dword ptr [rsp+20h],r8d 00007FF7C8CF88F0  and         eax,7F80h 00007FF7C8CF88F5  shr         r8d,18h 00007FF7C8CF88F9  and         r8d,7Fh 00007FF7C8CF88FD  movss       xmm1,dword ptr [rcx] 00007FF7C8CF8901  cmp         eax,7F80h 00007FF7C8CF8906  jne         __common_ssin_cout_rare+5Ch (07FF7C8CF893Ch) _B1_2: 00007FF7C8CF8908  mov         byte ptr [rsp+23h],r8b 00007FF7C8CF890D  cmp         dword ptr [rsp+20h],7F800000h 00007FF7C8CF8915  jne         __common_ssin_cout_rare+4Dh (07FF7C8CF892Dh) _B1_3: 00007FF7C8CF8917  mov         eax,1 00007FF7C8CF891C  pxor        xmm0,xmm0 00007FF7C8CF8920  mulss       xmm1,xmm0 00007FF7C8CF8924  movss       dword ptr [rdx],xmm1 00007FF7C8CF8928  add         rsp,28h 00007FF7C8CF892C  ret _B1_4: 00007FF7C8CF892D  mulss       xmm1,xmm1 00007FF7C8CF8931  xor         eax,eax 00007FF7C8CF8933  movss       dword ptr [rdx],xmm1 _B1_5: 00007FF7C8CF8937  add         rsp,28h 00007FF7C8CF893B  ret _B1_6: 00007FF7C8CF893C  xor         eax,eax 00007FF7C8CF893E  add         rsp,28h 00007FF7C8CF8942  ret [/source]   Now, I'm not for one minute saying this code is in anyway bad. It's accurate, it's relatively well optimised, and it handles all manner of errors I'm unlikely to  ever encounter (e.g. the __common_ssin_cout_rare cases). If you need accuracy, then do use std::sin! (As an aside, its also splitting the YMM regs into XMM, but that's another story)   However, I will draw attention to the function call pre-amble (This is saving the state of the registers into the stack)   [source] 00007FF7C8CF7E00  push        rsi 00007FF7C8CF7E01  push        r14 00007FF7C8CF7E03  push        r15 00007FF7C8CF7E05  sub         rsp, 2C0h 00007FF7C8CF7E0C  xor         esi, esi 00007FF7C8CF7E0E  vmovups     ymmword ptr[rsp + 2A0h], ymm15 00007FF7C8CF7E17  vmovups     ymmword ptr[rsp + 260h], ymm14 00007FF7C8CF7E20  vmovups     ymmword ptr[rsp + 220h], ymm13 00007FF7C8CF7E29  vmovups     ymmword ptr[rsp + 240h], ymm12 00007FF7C8CF7E32  vmovups     ymmword ptr[rsp + 1C0h], ymm11 00007FF7C8CF7E3B  vmovups     ymmword ptr[rsp + 1E0h], ymm10 00007FF7C8CF7E44  vmovups     ymmword ptr[rsp + 1A0h], ymm9 00007FF7C8CF7E4D  vmovups     ymmword ptr[rsp + 280h], ymm8 00007FF7C8CF7E56  vmovups     ymmword ptr[rsp + 180h], ymm7 00007FF7C8CF7E5F  vmovups     ymmword ptr[rsp + 200h], ymm6 [/source]   and the post-amble: (Which is restoring the values from the stack back into the registers)   [source] 00007FF7C8CF7F61  vmovups     ymm6, ymmword ptr[rsp + 200h] 00007FF7C8CF7F6A  vmovups     ymm7, ymmword ptr[rsp + 180h] 00007FF7C8CF7F73  vmovups     ymm8, ymmword ptr[rsp + 280h] 00007FF7C8CF7F7C  vmovups     ymm9, ymmword ptr[rsp + 1A0h] 00007FF7C8CF7F85  vmovups     ymm10, ymmword ptr[rsp + 1E0h] 00007FF7C8CF7F8E  vmovups     ymm11, ymmword ptr[rsp + 1C0h] 00007FF7C8CF7F97  vmovups     ymm12, ymmword ptr[rsp + 240h] 00007FF7C8CF7FA0  vmovups     ymm13, ymmword ptr[rsp + 220h] 00007FF7C8CF7FA9  vmovups     ymm14, ymmword ptr[rsp + 260h] 00007FF7C8CF7FB2  vmovups     ymm15, ymmword ptr[rsp + 2A0h] 00007FF7C8CF7FBB  mov         r13, qword ptr[rsp + 0D8h] 00007FF7C8CF7FC3  vmovaps     ymm0, ymm1 00007FF7C8CF7FC7  add         rsp, 2C0h 00007FF7C8CF7FCE  pop         r15 00007FF7C8CF7FD0  pop         r14 00007FF7C8CF7FD2  pop         rsi 00007FF7C8CF7FD3  ret [/source]   Even ignoring *ALL* of the code that actually computes the sine values, I could have computed 16 'good enough for most people' sine values (even with the fmod fixing) in the same time it took to save and restore the stack for 8!!!   So getting back to your original comment. Rather than asking what are you doing that needs the performance, you should instead be asking:    What am I doing that demands the accuracy of a 500+ CPU op function call, instead of a 15 CPU op inlined call?   The answer, for the vast majority of people, is very rarely.    As an aside, I used to work on a middleware animation system with some fairly demanding AAA clients. For any given character, we might be required to slerp 50 -> 100 animations at any given time. Multiply that by the number of characters in a game, and the std lib calls start to add up. If you could handle 20 characters within your allocated 2ms per-frame with the stdlib functions, you could handle 50 using approximations. In the VFX world (where I am now), the difference is even bigger.
  8. RobTheBloke

    What is more expensive in float?

    Thanks for the guess tho, it is a cpu standard op, who knows of what advanced math lib, wheather intel or amd native ops, I believe they should not differ on this? Sines/cosines are standard maths functions. It is possible to Improve on the standard library implementations (depending on how much accuracy you are willing to sacrifice). So yeah, if you can find a way to reduce N cmath calls by one or more, then it's typically a good thing. I would be semi-inclined to make the switch from 4 funcs to 1 + sqrt without bothering to profile. It's highly unlikely that it will be slower (and that code can change between platforms, so an improvement on one platform might not be better on another). *IF* fast math is enabled in your compiler settings, then sqrt is a CPU instruction (if using strict or precise, then typically a standard library function will be used). There are some fairly decent arc-cos / arc-sin approximations around. Certainly the approximations I use aren't substantially worse than the non inverse functions. Worth reading these: http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648 https://www.ecse.rpi.edu/~wrf/Research/Short_Notes/arcsin/onlyelem.html Generally speaking, using floats will be quicker than double, but YMMV. (Long topic, so I'll leave that can of worms shut for now)
  9. RobTheBloke

    How much Maths/Physics do I have to know?

    It's not really a question of how much maths/physics you need to learn (the more the better, but you have enough to start with). Matrices, vectors, mechanics (f=ma, describing forces, etc), solving linear and quadratic equations (collision detection), statistics (compression schemes, AI). The only esoteric thing you may not have come across is a quarternion (essentially a rotation only matrix - but with a few additional advantages). If you hit most of those, cool. If not, they can be learned. The biggest problem though, is not really how the maths/physics works, but how and when you need to apply calculation X to game feature Y. That can admittedly take some time, but that won't stop you enjoying games development / programming for the sake of it
  10. Respectfully disagree, strongly. I've taught games programming at university. I've taught games programming at school. I've taught seasoned veterans the nuances of SIMD/threading/mathematics/physics, and all manner of subjects. For some people, spending six months understanding the concepts underlying programming is useful. For the vast majority, it simply removes the thing that makes programming enjoyable - overcoming challenges in the way of *your goals*. The greatest challenge facing any recruiter in games or film vfx right now, is that For every Million graduates who know what a pointer is, only 5000 have spent their time building software (rather than learning languages), and of those 5000, only 100 have a deep enough understanding to be able to jump right into a programming role. you can read a C++ book cover to cover. You can read the Intel intrinsics guide and optimisation manuals. You can read the latest vulkan/d3d specs. You can do all of that, and you will have learnt nothing. All of those are vital steps along the way to becoming a professional game developer, but so is experience. So is learning from mistakes. So is biting off more than you can chew, and battling your way through the pain to a workable solution. Mistakes and wrong turns, are by far the fastest way to learn. Implement something (badly), read that c++ book again, and very rapidly you will learn how this programming stuff fits together. Be cautious, choose 'simple' excercises, and you'll find that your learning cannot keep pace with the technology.
  11. RobTheBloke

    What's wrong with this snake game code?

    That a bit dogmatic. I hear people say the same thing of functional programming. Just write code that is simple, stupid, easy to understand, and works. OOP is a useful tool, but not a one size fits all solution.
  12. The best game to start with is one that is simple enough to complete, but complex enough to be challenging. Snake fits that bill. If you only ever look for simple challenges, you won't progress beyond hello world. The joy of programming is finally solving that problem that has had you stumped for 3 days.
  13. Option 3) if your wheel needs to revolve once in 60 frames, create a rotation matrix for 360/60 degrees, and multiply that matrix with your wheels' transformation matrix each frame. It avoids the need for calls to sin/cos. Really though, stop worrying. If you're starting out in games programming, the first aim should be to get *something* working. Later on you can discover ways to improve performance. Even later still, you can discover the best ways to exploit hardware performance to the full. As a rule of thumb, always choose the simple option. Is it going to be easier to rotate 400 vertices, or modify one transformation matrix? Code optimisation is the art of always choosing the simplest option. Simpler code = faster code (generally speaking)
  14. The simpler workaround is to use optimised release builds in a profiler when doing performance testing.
  15. RobTheBloke

    Windows 10 - OpenGL Version

    That's a slight misrepresentation of the past there! It wasn't entirely a FUD storm, OpenGL was going to end up being relegated to a second class citizen, but not for the reasons you state. The original plan was that when OpenGL was run, the aero theme would switch off (no transparency on the desktop). If you wanted your app to be compatible with aero, you needed to use d3d (or the OpenGL 1.4 -> d3d wrapper). This was the behaviour in the first vista release candidates (I actually remember having serious discussions about porting our 3d app to d3d) If you were a game developer using gl then, none of this was a concern (you'd be full screen anyway). For 3d app devs though, this was a pretty big concern back in the day (your competitor, with a d3d viewport, would have a nicer looking gui than you). Microsoft eventually backed down after autodesk/Adobe and others pointed out they needed OpenGL for their products.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!