About intrinsics

Started by
4 comments, last by jvt9999 17 years, 8 months ago
I'm trying to learn SSE intrinsices and I've written a dot product function with the following code:

		inline float Dot(const CVector4 &vVector)
		{
			__m128 vec1;
			__m128 vec2;


			vec1 = _mm_mul_ps(this->v, vVector.v);
			vec2 = _mm_movehl_ps(vec1, vec1);
			vec2 = _mm_add_ss(vec2,vec1);
			vec1 = _mm_shuffle_ps(vec1,vec1, _MM_SHUFFLE(1,0,0,0));
			vec1 = _mm_add_ss(vec1,vec2);
			return vec1.m128_f32[0];			
		}
Now when I look a the assembly code generated from the code above, it seems that the compiler (VC++ 2005) generates unneeded movaps instructions:

	mov	eax, DWORD PTR _vVector$[ebp]
	movaps	xmm1, XMMWORD PTR [ecx]
	movaps	xmm0, XMMWORD PTR [eax]
	mulps	xmm0, xmm1
	movaps	xmm1, xmm0 <--- ???
	movhlps	xmm1, xmm0 
	movaps	xmm2, xmm0 <--- ???
	addss	xmm1, xmm0
	shufps	xmm2, xmm0, 1 
	addss	xmm2, xmm1
	movaps	XMMWORD PTR _vec1$[esp+16], xmm2
	movss	xmm0, DWORD PTR _vec1$[esp+16]
Can I modify the code in such a way that the compiler wouldn't add these unnecessary instructions or is there no way around it? I don't want to use inline assembly since the compiler usually adds a bunch of code before and after the assembly block which makes the generated code a lot slower. Any suggestions?
Advertisement
If you want to use assembly and have it inlined with no setup you can use:

__declspec(naked)inline float Dot(const CVector4 &vVector){   __asm   {   ...   }}


You lose some of the optimizers ability to analyze the code but the compiler intrinsics aren't generally seen as that optimal. If you're starting to use sse intrinsics you've probably analyzed the performance benefit and current cost enough to warrant hand optimizing that section.
Ah got it, thanks for the tip. I'll try moving the member functions out of the class and use __declspec(naked)
this defeats the intrinsics' feature of being usable for x64 and ia64 compilation, though. inline assembly isn't supported on these platforms.
Are you compiling in debug or release mode?
Also, don't store your results back into vars you're using already.

For example, try changing this:
vec2 = _mm_add_ss(vec2,vec1);
to
__m128 vec3 = _mm_add_ss(vec2,vec1);

Aliasing is evil, especially in cases where the compiler might be a bit shaky to begin with (such as SIMD code). So even simple stuff like the above might help.

I'd definitely avoid inline asm and stick with assembly. (For now, at least. If you profile it and find it to be too slow still, and you then write an inline ASM version, profile that and find it to be a lot faster, you might go with that)
Quote:Original post by Spoonbender
Are you compiling in debug or release mode?
Also, don't store your results back into vars you're using already.

For example, try changing this:
vec2 = _mm_add_ss(vec2,vec1);
to
__m128 vec3 = _mm_add_ss(vec2,vec1);

Aliasing is evil, especially in cases where the compiler might be a bit shaky to begin with (such as SIMD code). So even simple stuff like the above might help.

I'd definitely avoid inline asm and stick with assembly. (For now, at least. If you profile it and find it to be too slow still, and you then write an inline ASM version, profile that and find it to be a lot faster, you might go with that)


I'm compiling in release mode. I tried assigning to different vars but it still produces the same code. I'm not really using any of this code yet for anything yet since I'm just trying to learn SSE intrinsics but I was just curious why the compiler generates the extra movaps instructions. Does other compilers have these problems too?

This topic is closed to new replies.

Advertisement