Jump to content

  • Log In with Google      Sign In   
  • Create Account


Performance optimization SSE vector dot/normalize


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 09:26 AM

Hey guys, I've done a "quick" first implementation of a vector normalize and dot product using sse intrinsics and was wondering if there's still something that could be optimized further.

 

Here's my code:

_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp;

		// copy data into the 128bit register
		tmp = _mm_set_ps(w, z, y, x);

		// 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		
		// compute rsqrt of the dot product
		dp = _mm_rsqrt_ps(dp);

		// vec * rsqrt(dot(vec, vec))
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = tmp;

		vec.x = uf.f[0];
		vec.y = uf.f[1];
		vec.z = uf.f[2];
		vec.w = 1.0f;

		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a;
		
		// copy data into the 128bit register
		a = _mm_set_ps(w, z, y, x);
		__m128 b = _mm_set_ps(v2.w, v2.z, v2.y, v2.x);

		// 0x7F = 0111 1111 ~ means we don't want w-component multiplied
		// and the result written to all 4 components
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 
		
		Vec4 vec;
		union {__m128 v; float f[4]; } uf; // to access the 4 floats
		uf.v = dp;

		return uf.f[0];
	}
};

Thanks in advance smile.png



Sponsor:

#2 Hodgman   Moderators   -  Reputation: 27761

Like
3Likes
Like

Posted 20 September 2013 - 10:59 AM

I don't know if this is the case, but I've seen often that if you've got lots of functions like this that move data back and forth between float registers and SSE registers, then that ends up wasting so much time, that SSE becomes the same speed (or slower) than regular float stuff!

I'd try to make your Vec4 class store it's members as an __m128.

In some engines I've even seen a floatInVec class (or Vec1), which internally is an __m128, but to the user it appears like a float. This lets you keep more of your math code working with SSE data, rather than having it juggle between float and SSE.

 

 

 

Regarding the Normalize and Dot functions - considering that you're only really operating on a Vec3, then if you're running these over large data-sets, it might be worthwhile making an SoAVec3 class (as well as the normal Vec4 class), which stores 4 x values, then 4 y values, then 4 z values in 3 __m128 variables. This reduces the total working size of the data-set, and often lets you write more optimized (shorter) functions.



#3 achild   Crossbones+   -  Reputation: 1596

Like
0Likes
Like

Posted 20 September 2013 - 11:15 AM

[deleted - stupid post suggesting the use of the dot product intrinsic when it is clearly already being used]


Edited by achild, 20 September 2013 - 11:31 AM.


#4 lipsryme   Members   -  Reputation: 986

Like
1Likes
Like

Posted 20 September 2013 - 11:25 AM

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost



#5 achild   Crossbones+   -  Reputation: 1596

Like
0Likes
Like

Posted 20 September 2013 - 11:29 AM

@achild As you can see above I'm already using that intrinsic for the dot product rolleyes.gif

@hodgman I was thinking about that...I'll give it a try and see if it gives me some boost

 

blink.png  Wow. Not much you can say to that. Talk about not paying attention.



#6 AllEightUp   Moderators   -  Reputation: 4121

Like
1Likes
Like

Posted 20 September 2013 - 01:40 PM

As Hodge says, due to the load and then store costs, the code is likely going to be loosing so much performance that it may not be giving you any gains at all. Having said that, get rid of the union bits and replace it with the appropriate sse calls. Using that union trick is causing the compiler to flush the registers back to memory just so you can access them to put them in different memory. Here's the modified code (removed comments for brevity):
 
_declspec(align(16))
struct Vec4
{
	float x, y, z, w;

	inline const Vec4 Normalize()
	{
		__m128 tmp = _mm_load_ps( &x );
		__m128 dp = _mm_dp_ps(tmp, tmp, 0x7F); 
		dp = _mm_rsqrt_ps(dp);
		tmp = _mm_mul_ps(tmp, dp);
		
		Vec4 vec;
		_mm_store_ps( &vec.x, tmp );
		vec.w = 1.0f;
		return vec;
	}

	inline const float Dot(const Vec4 &v2)
	{
		__m128 a = _mm_load_ps( &x );
		__m128 b = _mm_load_ps( &v2.x );
		__m128 dp = _mm_dp_ps(a, b, 0x7F); 

		float result;
		_mm_store_ss( &result, dp );
		return result;
	}
};
That should be a bit faster since it removes the unneeded register flushes and leverages the aligned load speeds given that the class is 16 byte aligned. In effect, even though you are not using an __m128 for storage in the class, this is treating the class as one anyway.

NOTE: Also note that these two functions are full of wait states due to the latencies of the operations being performed. If you are doing batches of normalizations/dot products, running 2 or 4 at a time interleaved will effectively triple the throughput of the function. Given SSE4, you don't actually need the SoA data reorg Hodge suggests, you just need to deal with more than one in flight at a time.

Edited by AllEightUp, 20 September 2013 - 01:52 PM.


#7 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 02:09 PM

I just tried your updated code AllEightUp and it actually got slower blink.png


Edited by lipsryme, 20 September 2013 - 02:11 PM.


#8 AllEightUp   Moderators   -  Reputation: 4121

Like
0Likes
Like

Posted 20 September 2013 - 02:16 PM

I just tried your updated code AllEightUp and it actually got slower blink.png


Erm, hmmmmmm.... Doesn't seem possible unless the compiler is making a mess which also seems unlikely. That code *should* be as close to the fewest cycles as would be possible without your vector class itself containing an __m128 so you can pass by register instead of reference. I'll have to play with it myself a bit and see if I missed something. What compiler are you using to test the code? And obviously is it a release build? smile.png

#9 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 02:22 PM

Yea release build and compiler is msvc (visual studio 2008)

 

 

edit: Doing it something like this makes it a little bit faster than before now:

// 4-component vector class using SIMD instructions
_declspec(align(16))
struct Vec4
{
    __m128 v;

    Vec4()
    {

    }

    Vec4(float x, float y, float z, float w)
    {
        v = _mm_set_ps(w, z, y, x);
    }

    inline const float X()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[0];
    }

    inline const float Y()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[1];
    }

    inline const float Z()
    {
        union {__m128 v; float f[4]; } uf; // to access the 4 floats
        uf.v = v;
        return uf.f[2];
    }

    inline const void Normalize()
    {
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
	  }

    inline const float Dot(const Vec4 &v2) const
    {
        // 0x7F = 0111 1111 ~ means we don't want w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v2.v, 0x7F); 

        float result;
        _mm_store_ss(&result, dp);

        return result;
	  }
};

Edited by lipsryme, 20 September 2013 - 02:48 PM.


#10 AllEightUp   Moderators   -  Reputation: 4121

Like
0Likes
Like

Posted 20 September 2013 - 02:34 PM

Oh.... VC 2008 was exceptionally bad with SIMD, it could be doing some really stupid stuff behind the scenes. 2010 and 2012 have massively improved SIMD handling.. You might set a breakpoint on the function switch to disassembly view and see if it is doing anything obviously stupid. Drop the disassembly here and I can take a peek at it also.

#11 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 02:54 PM

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

 

Here's the disassembly from the new code:

   inline const void Normalize()
    {
00413480  push        ebx  
00413481  mov         ebx,esp 
00413483  sub         esp,8 
00413486  and         esp,0FFFFFFF0h 
00413489  add         esp,4 
0041348C  push        ebp  
0041348D  mov         ebp,dword ptr [ebx+4] 
00413490  mov         dword ptr [esp+4],ebp 
00413494  mov         ebp,esp 
00413496  sub         esp,158h 
0041349C  push        esi  
0041349D  push        edi  
0041349E  push        ecx  
0041349F  lea         edi,[ebp-158h] 
004134A5  mov         ecx,56h 
004134AA  mov         eax,0CCCCCCCCh 
004134AF  rep stos    dword ptr es:[edi] 
004134B1  pop         ecx  
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)] 
004134B7  xor         eax,ebp 
004134B9  mov         dword ptr [ebp-4],eax 
004134BC  mov         dword ptr [ebp-0Ch],ecx 
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 
004134BF  mov         eax,dword ptr [ebp-0Ch] 
004134C2  movaps      xmm0,xmmword ptr [eax] 
004134C5  mov         ecx,dword ptr [ebp-0Ch] 
004134C8  movaps      xmm1,xmmword ptr [ecx] 
004134CB  dpps        xmm1,xmm0,7Fh 
004134D1  movaps      xmmword ptr [ebp-150h],xmm1 
004134D8  movaps      xmm0,xmmword ptr [ebp-150h] 
004134DF  movaps      xmmword ptr [ebp-30h],xmm0 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h] 
004134E7  movaps      xmmword ptr [ebp-130h],xmm0 
004134EE  movaps      xmm0,xmmword ptr [ebp-130h] 
004134F5  movaps      xmmword ptr [ebp-30h],xmm0 

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h] 
004134FD  mov         eax,dword ptr [ebp-0Ch] 
00413500  movaps      xmm1,xmmword ptr [eax] 
00413503  mulps       xmm1,xmm0 
00413506  movaps      xmmword ptr [ebp-110h],xmm1 
0041350D  mov         ecx,dword ptr [ebp-0Ch] 
00413510  movaps      xmm0,xmmword ptr [ebp-110h] 
00413517  movaps      xmmword ptr [ecx],xmm0 
	  }
0041351A  pop         edi  
0041351B  pop         esi  
0041351C  mov         ecx,dword ptr [ebp-4] 
0041351F  xor         ecx,ebp 
00413521  call        @ILT+140(@__security_check_cookie@4) (411091h) 
00413526  mov         esp,ebp 
00413528  pop         ebp  
00413529  mov         esp,ebx 
0041352B  pop         ebx  
0041352C  ret              

Edited by lipsryme, 20 September 2013 - 02:58 PM.


#12 AllEightUp   Moderators   -  Reputation: 4121

Like
1Likes
Like

Posted 20 September 2013 - 03:07 PM

Eef, there is bad and then there is *that* holy disaster. VC is completely destroying the function for some reason moving and flushing the data multiple times for no apparent reason. Are you sure you are patched all the way up and have the processor pack and all that? It was bad sure but that is absolutely pitiful. I'll post up my assembly output here, for the same thing in just a minute soon as I get a release with symbols built.

Here's the VC2012 codegen:

00B81A06 movaps xmm1,xmmword ptr [__xmm@0000000040400000400000003f800000 (0BDF4C0h)]
00B81A0D movaps xmm0,xmm1
00B81A10 dpps xmm0,xmm1,77h
00B81A16 sqrtps xmm0,xmm0
00B81A19 movss xmm2,dword ptr [__real@3c23d70a (0BDF384h)]
00B81A21 divps xmm1,xmm0

Just a bit more reasonable. smile.png

That misses the storage back to the wrapper because my code passes by register and this is in a unit test so I had to extract the code from the surrounding setup/teardown of the test. Oh, and it uses the full sqrt and divs due to needing a bit more accuracy in the calling code.

Edited by AllEightUp, 20 September 2013 - 03:24 PM.


#13 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 03:49 PM

Just installed SP1 -> still the same...



#14 AllEightUp   Moderators   -  Reputation: 4121

Like
0Likes
Like

Posted 20 September 2013 - 03:57 PM

Unfortunately then, given that codegen, you might want to go back to FPU, likely couldn't be any slower... sad.png

Though, I suppose you could try actual inline assembly to avoid the codegen. Hmm....

Edited by AllEightUp, 20 September 2013 - 04:03 PM.


#15 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 20 September 2013 - 04:04 PM

Well the most recent code I've posted here is now a little bit faster than the first one up there so it's already a small improvement.


Edited by lipsryme, 20 September 2013 - 04:07 PM.


#16 Hodgman   Moderators   -  Reputation: 27761

Like
3Likes
Like

Posted 20 September 2013 - 07:54 PM

With VC2008, go into your project properties -> C/C++, and set

Enable Enhanced Instruction Set to Streaming SIMD Extensions 2 (/arch:SSE2)

and

Floating Point Model to Fast (/fp:fast)

 

I've found that it produces much less ridiculous float code with these options than compared to the defaults.



#17 lipsryme   Members   -  Reputation: 986

Like
0Likes
Like

Posted 21 September 2013 - 03:39 AM

Already had /fp:fast but yea it got a little faster now (almost 15 percent faster than before). Thanks :)


Edited by lipsryme, 21 September 2013 - 03:39 AM.


#18 DT....   Members   -  Reputation: 487

Like
0Likes
Like

Posted 22 September 2013 - 02:45 AM

http://www.agner.org/optimize/



#19 momotte   Members   -  Reputation: 171

Like
0Likes
Like

Posted 09 October 2013 - 11:14 AM

 

edit: holy cow VC2012 is almost twice as fast...problem is I'm forced to use 2008 for this project.

 

Here's the disassembly from the new code:

   inline const void Normalize()
    {
00413480  push        ebx  
00413481  mov         ebx,esp 
00413483  sub         esp,8 
00413486  and         esp,0FFFFFFF0h 
00413489  add         esp,4 
0041348C  push        ebp  
0041348D  mov         ebp,dword ptr [ebx+4] 
00413490  mov         dword ptr [esp+4],ebp 
00413494  mov         ebp,esp 
00413496  sub         esp,158h 
0041349C  push        esi  
0041349D  push        edi  
0041349E  push        ecx  
0041349F  lea         edi,[ebp-158h] 
004134A5  mov         ecx,56h 
004134AA  mov         eax,0CCCCCCCCh 
004134AF  rep stos    dword ptr es:[edi] 
004134B1  pop         ecx  
004134B2  mov         eax,dword ptr [___security_cookie (42F0D0h)] 
004134B7  xor         eax,ebp 
004134B9  mov         dword ptr [ebp-4],eax 
004134BC  mov         dword ptr [ebp-0Ch],ecx 
        // 0x7F = 0111 1111 ~ means we don't want the w-component multiplied
        // and the result written to all 4 components
        __m128 dp = _mm_dp_ps(v, v, 0x7F); 
004134BF  mov         eax,dword ptr [ebp-0Ch] 
004134C2  movaps      xmm0,xmmword ptr [eax] 
004134C5  mov         ecx,dword ptr [ebp-0Ch] 
004134C8  movaps      xmm1,xmmword ptr [ecx] 
004134CB  dpps        xmm1,xmm0,7Fh 
004134D1  movaps      xmmword ptr [ebp-150h],xmm1 
004134D8  movaps      xmm0,xmmword ptr [ebp-150h] 
004134DF  movaps      xmmword ptr [ebp-30h],xmm0 

        // compute rsqrt of the dot product
        dp = _mm_rsqrt_ps(dp);
004134E3  rsqrtps     xmm0,xmmword ptr [ebp-30h] 
004134E7  movaps      xmmword ptr [ebp-130h],xmm0 
004134EE  movaps      xmm0,xmmword ptr [ebp-130h] 
004134F5  movaps      xmmword ptr [ebp-30h],xmm0 

        // vec * rsqrt(dot(vec, vec))
        v = _mm_mul_ps(v, dp);
004134F9  movaps      xmm0,xmmword ptr [ebp-30h] 
004134FD  mov         eax,dword ptr [ebp-0Ch] 
00413500  movaps      xmm1,xmmword ptr [eax] 
00413503  mulps       xmm1,xmm0 
00413506  movaps      xmmword ptr [ebp-110h],xmm1 
0041350D  mov         ecx,dword ptr [ebp-0Ch] 
00413510  movaps      xmm0,xmmword ptr [ebp-110h] 
00413517  movaps      xmmword ptr [ecx],xmm0 
	  }
0041351A  pop         edi  
0041351B  pop         esi  
0041351C  mov         ecx,dword ptr [ebp-4] 
0041351F  xor         ecx,ebp 
00413521  call        @ILT+140(@__security_check_cookie@4) (411091h) 
00413526  mov         esp,ebp 
00413528  pop         ebp  
00413529  mov         esp,ebx 
0041352B  pop         ebx  
0041352C  ret              

 

 

This looks awfully like a debug build.

(security cookie + load/stores on stack for every single SSE intrinsic)



#20 Nypyren   Crossbones+   -  Reputation: 3716

Like
0Likes
Like

Posted 09 October 2013 - 01:46 PM

Make sure to thoroughly check your compile settings.  The Debug/Release switch won't necessarily give you the results you'd expect if you (or someone else on your team) has disabled the relevant optimizations in the Release build's settings.






Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS