Performance optimization SSE vector dot/normalize

Started by
19 comments, last by Vilem Otte 10 years, 6 months ago

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

Here's the disassembly from the new code:


   inline const void Normalize()
    {
00413480  push        ebx  
00413481  mov         ebx,esp 
00413483  sub         esp,8 
00413486  and         esp,0FFFFFFF0h 
00413489  add         esp,4 
0041348C  push        ebp  
0041348D  mov         ebp,dword ptr [ebx+4] 
00413490  mov         dword ptr [esp+4],ebp 
00413494  mov         ebp,esp 
00413496  sub         esp,158h 
0041349C  push        esi  
0041349D  push        edi  
0041349E  push        ecx  
0041349F  lea         edi,[ebp-158h] 
004134A5  mov         ecx,56h 
004134AA  mov         eax,0CCCCCCCCh 
004134AF  rep stos    dword ptr es:[edi] 
004134B1  pop         ecx  
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)] 
004134B7  xor         eax,ebp 
004134B9  mov         dword ptr [ebp-4],eax 
004134BC  mov         dword ptr [ebp-0Ch],ecx 
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 
004134BF  mov         eax,dword ptr [ebp-0Ch] 
004134C2  movaps      xmm0,xmmword ptr [eax] 
004134C5  mov         ecx,dword ptr [ebp-0Ch] 
004134C8  movaps      xmm1,xmmword ptr [ecx] 
004134CB  dpps        xmm1,xmm0,7Fh 
004134D1  movaps      xmmword ptr [ebp-150h],xmm1 
004134D8  movaps      xmm0,xmmword ptr [ebp-150h] 
004134DF  movaps      xmmword ptr [ebp-30h],xmm0 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h] 
004134E7  movaps      xmmword ptr [ebp-130h],xmm0 
004134EE  movaps      xmm0,xmmword ptr [ebp-130h] 
004134F5  movaps      xmmword ptr [ebp-30h],xmm0 

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h] 
004134FD  mov         eax,dword ptr [ebp-0Ch] 
00413500  movaps      xmm1,xmmword ptr [eax] 
00413503  mulps       xmm1,xmm0 
00413506  movaps      xmmword ptr [ebp-110h],xmm1 
0041350D  mov         ecx,dword ptr [ebp-0Ch] 
00413510  movaps      xmm0,xmmword ptr [ebp-110h] 
00413517  movaps      xmmword ptr [ecx],xmm0 
	  }
0041351A  pop         edi  
0041351B  pop         esi  
0041351C  mov         ecx,dword ptr [ebp-4] 
0041351F  xor         ecx,ebp 
00413521  call        @ILT+140(@__security_check_cookie@4) (411091h) 
00413526  mov         esp,ebp 
00413528  pop         ebp  
00413529  mov         esp,ebx 
0041352B  pop         ebx  
0041352C  ret              
Advertisement
Eef, there is bad and then there is *that* holy disaster. VC is completely destroying the function for some reason moving and flushing the data multiple times for no apparent reason. Are you sure you are patched all the way up and have the processor pack and all that? It was bad sure but that is absolutely pitiful. I'll post up my assembly output here, for the same thing in just a minute soon as I get a release with symbols built.

Here's the VC2012 codegen:

00B81A06 movaps xmm1,xmmword ptr [__xmm@0000000040400000400000003f800000 (0BDF4C0h)]
00B81A0D movaps xmm0,xmm1
00B81A10 dpps xmm0,xmm1,77h
00B81A16 sqrtps xmm0,xmm0
00B81A19 movss xmm2,dword ptr [__real@3c23d70a (0BDF384h)]
00B81A21 divps xmm1,xmm0

Just a bit more reasonable. smile.png

That misses the storage back to the wrapper because my code passes by register and this is in a unit test so I had to extract the code from the surrounding setup/teardown of the test. Oh, and it uses the full sqrt and divs due to needing a bit more accuracy in the calling code.

Just installed SP1 -> still the same...

Unfortunately then, given that codegen, you might want to go back to FPU, likely couldn't be any slower... sad.png

Though, I suppose you could try actual inline assembly to avoid the codegen. Hmm....

Well the most recent code I've posted here is now a little bit faster than the first one up there so it's already a small improvement.

With VC2008, go into your project properties -> C/C++, and set

Enable Enhanced Instruction Set to Streaming SIMD Extensions 2 (/arch:SSE2)

and

Floating Point Model to Fast (/fp:fast)

I've found that it produces much less ridiculous float code with these options than compared to the defaults.

Already had /fp:fast but yea it got a little faster now (almost 15 percent faster than before). Thanks :)

http://www.agner.org/optimize/

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

Here's the disassembly from the new code:


   inline const void Normalize()
    {
00413480  push        ebx  
00413481  mov         ebx,esp 
00413483  sub         esp,8 
00413486  and         esp,0FFFFFFF0h 
00413489  add         esp,4 
0041348C  push        ebp  
0041348D  mov         ebp,dword ptr [ebx+4] 
00413490  mov         dword ptr [esp+4],ebp 
00413494  mov         ebp,esp 
00413496  sub         esp,158h 
0041349C  push        esi  
0041349D  push        edi  
0041349E  push        ecx  
0041349F  lea         edi,[ebp-158h] 
004134A5  mov         ecx,56h 
004134AA  mov         eax,0CCCCCCCCh 
004134AF  rep stos    dword ptr es:[edi] 
004134B1  pop         ecx  
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)] 
004134B7  xor         eax,ebp 
004134B9  mov         dword ptr [ebp-4],eax 
004134BC  mov         dword ptr [ebp-0Ch],ecx 
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 
004134BF  mov         eax,dword ptr [ebp-0Ch] 
004134C2  movaps      xmm0,xmmword ptr [eax] 
004134C5  mov         ecx,dword ptr [ebp-0Ch] 
004134C8  movaps      xmm1,xmmword ptr [ecx] 
004134CB  dpps        xmm1,xmm0,7Fh 
004134D1  movaps      xmmword ptr [ebp-150h],xmm1 
004134D8  movaps      xmm0,xmmword ptr [ebp-150h] 
004134DF  movaps      xmmword ptr [ebp-30h],xmm0 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h] 
004134E7  movaps      xmmword ptr [ebp-130h],xmm0 
004134EE  movaps      xmm0,xmmword ptr [ebp-130h] 
004134F5  movaps      xmmword ptr [ebp-30h],xmm0 

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h] 
004134FD  mov         eax,dword ptr [ebp-0Ch] 
00413500  movaps      xmm1,xmmword ptr [eax] 
00413503  mulps       xmm1,xmm0 
00413506  movaps      xmmword ptr [ebp-110h],xmm1 
0041350D  mov         ecx,dword ptr [ebp-0Ch] 
00413510  movaps      xmm0,xmmword ptr [ebp-110h] 
00413517  movaps      xmmword ptr [ecx],xmm0 
	  }
0041351A  pop         edi  
0041351B  pop         esi  
0041351C  mov         ecx,dword ptr [ebp-4] 
0041351F  xor         ecx,ebp 
00413521  call        @ILT+140(@__security_check_cookie@4) (411091h) 
00413526  mov         esp,ebp 
00413528  pop         ebp  
00413529  mov         esp,ebx 
0041352B  pop         ebx  
0041352C  ret              

This looks awfully like a debug build.

(security cookie + load/stores on stack for every single SSE intrinsic)

Make sure to thoroughly check your compile settings. The Debug/Release switch won't necessarily give you the results you'd expect if you (or someone else on your team) has disabled the relevant optimizations in the Release build's settings.

This topic is closed to new replies.

Advertisement