SSE problem!

Started by
1 comment, last by Andybean9 18 years, 10 months ago
Hi everyone, I am having some trouble understanding why my SSE instructions are almost 16 times slower than my regular function. I am using SSE in computing the magnitude of a vector. The code looks like this:

			inline float GetLength()
			{
				float f;

				// If we don't have SSE support do it the slow crappy way
				if(!g_bSSE)
					f = (float)sqrt(m_fX*m_fX + m_fY*m_fY + m_fZ*m_fZ);
				else
				{
					float *pf = &f;
					m_fW = 0.0f;

					__asm
					{
						mov		ecx, pf			; Point to the result register
						mov		esi, this		; Move the vector into the esi
						movups	xmm0, [esi]		; Move the vector into the xmm0
						mulps	xmm0, xmm0		; Multiply the register
						movaps	xmm1, xmm0		; Copy the result
						shufps	xmm1, xmm1, 4Eh	; Shuffle: f1, f0, f3, f2
						addps	xmm0, xmm1
						movaps	xmm1, xmm0		; Copy result
						shufps	xmm1, xmm1, 11h
						addps	xmm0, xmm1
						sqrtss	xmm0, xmm0		; Square root
						movss	[ecx], xmm0		; Move result to ecx which is currently f
					}

					m_fW = 1.0f;
				}
				return f;
			}

When I run it and SSE is supported by the processor that code is extremely slow. When I change the bool to make it run the regular sqrt function it goes blazingly fast. Is there something I am missing in using SSE? Thank you for your time and help! Andrew
Advertisement
If you're going to use SSE, get CodeAnalyst.
Learn to use the 'pipeline simulation mode'.

The problem here is a terrible dependency chain. You keep using the result of each operation as an input to the next operation... the result won't be ready for another 3 or 4 cycles, so each instruction is stalling on the previous one. Another thing to keep in mind with SSE is that scalar operations are usually less expensive than parallel ones, and shufps are always expensive.

You could rewrite:
mulps	xmm0, xmm0		; Multiply the registermovaps	xmm1, xmm0		; Copy the resultshufps	xmm1, xmm1, 4Eh	; Shuffle: f1, f0, f3, f2addps	xmm0, xmm1movaps	xmm1, xmm0		; Copy resultshufps	xmm1, xmm1, 11haddps	xmm0, xmm1

as:
mulps	xmm0, xmm0movaps	xmm2, xmm0movhlps xmm1, xmm0shufps	xmm2, xmm2, 01010101b    ; copy y-component into x-componentaddss	xmm0, xmm1addss	xmm2, xmm0

and it will run faster. There will certainly still be stalling, but you should always prefer 'movhlps/*ss' to 'shufps/*ps' to do a scalar op. (And write your argument to shufps as a binary number, so that each pair of bits indicates what's being written to the component... with hex it is much harder to decipher.)

If you want to be a good SSE programmer, you need to learn one important lesson: the *ps instructions will do more at once than you can do with scalar ops BUT(!), the latency (i.e. when you can expect the result to be ready) will be twice as bad. So it only benefits you if you can find something worthwhile to do during those latent cycles. In this case, there is nothing... so it will probably be faster to write it scalarly (this is a fast vector normalize I posted in a different thread):

struct vec3{    float x, y, z;};void normalize(vec3* v){    __asm {        mov   eax,  v        movss xmm0, [eax]vec3.x        movss xmm1, [eax]vec3.y        movss xmm2, [eax]vec3.z        mulss xmm0, xmm0        mulss xmm1, xmm1        mulss xmm2, xmm2        addss xmm0, xmm1        addss xmm2, xmm0        movss xmm4, [eax]vec3.x        movss xmm5, [eax]vec3.y        movss xmm6, [eax]vec3.z        rsqrtss xmm2, xmm2         mulss  xmm4, xmm2        mulss  xmm5, xmm2        mulss  xmm6, xmm2        movss  [eax]vec3.x, xmm4        movss  [eax]vec3.y, xmm5        movss  [eax]vec3.z, xmm6    }}


Another thing to do, and this won't help your current problem but is just a good idea, is to align your vectors on 16-bytes. Most people think the benefit is that this allows you to use movaps instead of movups, but that part actually makes very little difference. SSE is quite an, ahem, "register-challenged" instruction set. When you align vecs on 16-bytes, it allows you to use that vec directly as an argument to all *ps instructions without first having to load it into a register. This is a HUGE benefit.

Another thing about SSE: NEVER do one thing at once. Always try to organize your data and algorithms that are time-critical so that you can perform your SSE code on more than one set of data at a time. You were astounded to find out that SSE Length() is !slower! than fpu. Sure, the way you wrote it. But do you understand that because of stalls, you can do 2 Length()'s in the same time that you do one? Do you realize that you could do FOUR Length()s in the same time as one? You can.

This should be a very good lesson: rewrite your code, duplicating each instruction 4 times: each one to find the Length() of a different vector, one in xmm0/xmm1, the next in xmm2/xmm3 etc. You should analyze the differences in CodeAnalyst to appreciate why I am advocating this rule.

Hmmm, even with not-that-good SSE, 16x benefit for fpu doesn't make sense. FPU uses real sqrt which is like 30 cycles, which makes the whole op take around 40, which means that SSE would take 640 cycles?! No freaking way! As it's written, the code you posted shouldn't take more than 70 cycles.

Anyway, I'll test the code... still it doesn't make sense. I'm sorry this post is very long, I hope it is worth the time to read. You should be very careful with CodeAnalyst as there is one annoying bug in the current release with pipeline simulation mode. You select 'simulation', and double-click the source file to analyze, you double-click the line of code, it expands the assembly, you set a start marker... and nothing happens! You then need to go into 'Tools->Project Options' and just click 'okay' without changing anything and magically the 'start simulation' icon becomes available!

It is a very crude toy, but holy crap is it useful.

[edit] regarding "prefer binary to hex as an argument to shufps", your comment is wrong! Your comment says y,x,w,z, but the code says x,y,w,z! Of course it doesn't affect the algorithm, but you should recognize it immediately as a flaw in the hex representation.

If you said 'shufps xmm0, xmm0, 00011110b ; Shuffle: f1, f0, f3, f2'
anyone reading this code, were there some error he were studying, would see this as an inaccurate comment. With 04Eh, it is much less obvious, to the point where the observer might just accept the comment, rather than do the hex computation in his head. Anyway, I am berating an insignificant point. I hope this is at least slightly useful.

[Edited by - ajas95 on June 19, 2005 10:33:34 PM]
Hello,
Thank you for your detailed reply! I shuffled my code around a bit to look more like yours and amazingly enough it now runs as fast as it should! I'm not sure what about the other way I was doing it that was making it hang so badly, but I'm glad it now runs fast. I'm new to SSE programming, so I'm still trying to discover the best ways to do things. By the way I was using Intel's VTUNE to do my performance analysis, but I will look into CodeAnylist as you suggest. Again, thank you! I'm also curius to see if you ran the code I posted and what results you got from it.

Andrew

This topic is closed to new replies.

Advertisement