Beware the SSE beast...

This topic is 4989 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

I've been using a lot of SSE lately to speed up the math heavy parts of my demos. I've been blown away by the speed improvements I've been getting. BUT, I do have to say that for scalar ops, it really really sucks. Take, for example, determinants. I wrote a little SSE function to calculate the determinant of a 3x3 matrix, formed by vectors v0, v1, and v2 (which is just v0 . (v1 x v2))
float det(const vec4& v0, const vec4& v1, const vec4& v2)
{
float result;

__asm {
mov	eax,	v0
mov	ecx,	v1
mov	edx,	v2

movaps	xmm3,	[ecx]
movaps	xmm2,	[edx]
movaps	xmm4,	[edx]
movaps	xmm1,	[ecx]
movaps	xmm0,	[eax]			// xmm0 :  ? Az Ay Ax
shufps	xmm3,	xmm3,	11001001b	// xmm3 :  ? Bx Bz By
shufps	xmm4,	xmm4,	11010010b	// xmm4 :  ? Cy Cx Cz
shufps	xmm1,	xmm1,	11010010b	// xmm1 :  ? By Bx Bz
shufps	xmm2,	xmm2,	11001001b	// xmm2 :  ? Cx Cz Cy
mulps	xmm4,	xmm3			// xmm4 =  ? BxCy BzCx ByCz
mulps	xmm2,	xmm1			// xmm2 =  ? ByCx BxCz BzCy
subps	xmm4,	xmm2			// xmm4 = v1 x v2
mulps	xmm4,	xmm0			// now dot v0 . xmm4
movhlps	xmm5,	xmm4
addss	xmm5,	xmm4			// standard dot product.  The dependency chain here is very bad.
shufps	xmm4,	xmm4,	11000001b
movss	result,	xmm4
}

return result;
}


not bad. 50 cycles. But then just for kicks I decided to back and write a scalar version just to see what would happen if I got rid of all those stupid shufps and movhlps.
float det_scalar(const vec4& v0, const vec4& v1, const vec4& v2)
{
float result;

__asm {
mov	eax,	v0
mov	ecx,	v1
mov	edx,	v2

movss	xmm0,	[ecx]vec4.y
movss	xmm1,	[edx]vec4.z
movss	xmm2,	[ecx]vec4.x
movss	xmm4,	xmm0
movss	xmm5,	xmm2
movss	xmm3,	[edx]vec4.x
movss	xmm7,	[ecx]vec4.z
movss	xmm6,	[edx]vec4.y
mulss	xmm0,	xmm1			// xmm0 : ByCz
mulss	xmm2,	xmm1			// xmm2 : BxCz		: xmm1 free
mulss	xmm4,	xmm3			// xmm4 : ByCx
movss	xmm1,	[ecx]vec4.z		// xmm1 : Bz
mulss	xmm5,	xmm6			// xmm5 : BxCy
mulss	xmm7,	xmm6			// xmm7 : BzCy		: xmm6 free
mulss	xmm1,	xmm3			// xmm1 : BzCx		: xmm3 free
subss	xmm5,	xmm4			// xmm5 : BxCy - ByCx	: xmm4 free
movss	xmm6,	[eax]vec4.x
subss	xmm1,	xmm2			// xmm2 : BzCx - BxCz	 : xmm2 free
subss	xmm0,	xmm7			// xmm0 : ByCz - BzCy	 : xmm7 free
movss	xmm3,	[eax]vec4.z
movss	xmm7,	[eax]vec4.y
mulss	xmm0,	xmm6			// xmm0 : Ax * (ByCz - BzCy)
mulss	xmm3,	xmm5			// xmm3 : Az * (BxCy - ByCx)
mulss	xmm7,	xmm1			// xmm7 : Ay * (BzCx - BxCz)
movss	result, xmm0
}

return result;
}


43 cycles! Almost a 15% improvement by converting the vector ops to scalar. SIMD my ass. Does anyone have a fast fpu determinant? maybe if scalar SSE is so much faster than SIMD SSE, maybe fpu will be fastest of all?

Share on other sites
FPU should be about the same. SSE isn't good when you have to do to many preparations before the actuall math. However sometimes it does generates some very nice speed improvments.

Share on other sites
1) This costs you through an AGI stall :

mov eax, v0
mov ecx, v1
mov edx, v2

2) Also your function will probably not be inlined, thus call/ret overhead. Specially if params are stored on the stack.

A way to get rid of this all with Visual : prefer the C Intel intrisics to asm. Only the inline asm of GCC is worth the deal. Calling conventions and fixed register allocation makes Visual C++ inline asm worthless unless you code a loop routine that hides the overheads as something marginal.

3) movss result, xmm4
return result
Also costs additional dependencies chains plus traffic on the stack. My way to do is use a scalar type (or C++ class) that is not a float. Usually it will be a duplicated float (s,s,s,s). This way no conversion cost and if you want to scale a vector by this determinant next, ir's immediate.

4) Last point, it might be possible to schedule the code a bit more.

All in all you'll see more clearly what SSE can actually do. Still the main problem is there are not enough ops to achieve full parallelism. Mostly two threads : the right and left parts of the cross product. And there is a long chain to make the final dot product. That's why FPU or ss SSE might beat ps here.

Share on other sites
Quote:
 Original post by Charles B1) This costs you through an AGI stall :mov eax, v0mov ecx, v1mov edx, v2

Hmm, maybe. is there a better way? I think that in the end, the first 2 movs are dispatched together, and as soon as eax is calculated, the movaps starts. And while the other addresses are generated, v0 is being shufpsed so it's all parallelized anyway. I think I've gotten too used to writing really bad integer code since it all gets hidden by the SSE code running in parallel.

Quote:
 2) Also your function will probably not be inlined, thus call/ret overhead. Specially if params are stored on the stack.

Sometimes inlined, sometimes not. I never really know unless I look. Many times I return values via non-const reference... I'll have to check if that affects inlining.

Quote:
 A way to get rid of this all with Visual : prefer the C Intel intrisics to asm.

Have you had good luck with intrinsics? I felt like I tried it once and was disappointed by the scheduling/register alloc. Also some of the VC++ intrinsics evaluate to function calls that don't inline! _mm_cvtps_pi16 for example. Also, I think intrinsic is harder to read than asm (asm is no picnic either, don't get me wrong).

Quote:
 Visual C++ inline asm worthless unless you code a loop routine that hides the overheads as something marginal.

yeah, that's definitely what I (almost) always do (see below). The determinant thing was just for kicks... I'd always been suspicious of cross-product as I typed in those shufps that maybe a scalar would be faster and decided to try. If you notice, the SIMD version only uses 5 registers while the scalar version uses all 8, so you can interleave other ops (and like below, after I compute (R x A) and need (R x B), R is already shuffled into position for a cross product, so it essentially gets 2 more registers 'for free'.

Quote:
 4) Last point, it might be possible to schedule the code a bit more....And there is a long chain to make the final dot product. That's why FPU or ss SSE might beat ps here.

Right, I noticed that movaps xmm2 should be after movaps xmm1... that little blunder probably propogates down the entire chain and costs a couple cycles by the end. If you look at sony's VUs, you can swizzle vec4 components within the instruction AND they have proper MADD. I wish they'd done that with SSE.

Anyway, just so you know I'm not knocking SSE too hard, here is a good example of better parallelism. It takes a ray(r0, r1) list of vertices (t*), and indices of triangles in t (idxs*), tests the (bi-directional, infinite) ray for intersection with each triangle. Indexes of intersected triangles are returned in results.tri_idx, with the function returning the # of intersections. Ray_x_triplane then takes the list of results and calculates result.t for the ray (r0 + t*(r1-r0)).

I use it like:

Ray_tri_result results[result_size];

// count is the number of intersections in the current batch.
int count = ray_x_tri_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[0], results, result_size);

// fill in the results.t values for intersected triangles.
ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[start_idx], results, count);

struct Ray_tri_result{	int	tri_idx;	// triangle formed by idxs[tri_idx*3], idxs[tri_idx*3 + 1], and idxs[tri_idx*3 + 2] intersects.	float	t;		// value of t along ray of intersection.};// use polyhedral volumes to determine whether a ray will intersect a triangle.  Only determines// whether ray will intersect interior of triangle for some value t in r0 + (r1 - r0)t,// not necessarily 0 <= t <= 1.0f.  If this test succeeds, call ray_x_triplane to determine the// exact value of t at intersection.int ray_x_tri_indexed_SSE(const vec4& r0, const vec4& r1, const vec4* t, const unsigned int* idxs, Ray_tri_result* results, int num_tris ){	int result_idx = 0;	int tri_count = 0;	// this function only works for groups because it preloads the next triangle.	// If the count is 0 or 1, we can't use this function.	if(num_tris == 0)		return 0;	else if(num_tris == 1)	{		return ray_x_tri_indexed(r0, r1, t, idxs, results);	}	num_tris--;	__asm {		mov		edi,	idxs		mov		eax,	r0		movaps	xmm7,	[eax]				// xmm7 : r0		mov		edx,	[edi + 4]		// here edx gets t1		shl		edx,	4			// each vert is 16 bytes		add		edx,	t		movaps	xmm0,	[edx]				// xmm0 : t1		mov		ecx,	r1		movaps	xmm1,	[ecx]				// xmm1 : r1		subps	xmm0,	xmm7				// subtract out ray origin from each vertex		mov		edx,	[edi]		shl		edx,	4			// each vert is 16 bytes		add		edx,	t		movaps	xmm2,	[edx]				// xmm2 : t0		subps	xmm1,	xmm7		subps	xmm2,	xmm7		mov		edx,	[edi + 8]		shl		edx,	4			// each vert is 16 bytes		add		edx,	t							movaps	xmm6,	[edx]				// put in xmm6 because we don't need it til later		subps	xmm6,	xmm7				// xmm6 = C; xmm7 free after this.  Stall here, but only once.start_loop:		// R = r1 - r0  :  xmm1		// A = t0 - r0  :  xmm2		// B = t1 - r0  :  xmm0		// C = t2 - r0  :  xmm6		// compute:		// det(R A B) as  B . (R x A)		// det(R B C) as  C . (R x B)		// det(R C A) as  C .-(R x A)		// if signs of 3 determinants are the same, then there is intersection.		add		edi,	0Ch			// advance 3 verts * 4 bytes/index		movaps	xmm3,	xmm1				// start R x A		movaps	xmm4,	xmm2		shufps	xmm3,	xmm3,	11001001b		// xmm3 :  ? Rx Rz Ry		shufps	xmm4,	xmm4,	11010010b		// xmm4 :  ? Ay Ax Az		shufps	xmm1,	xmm1,	11010010b		// xmm1 :  ? Ry Rx Rz		shufps	xmm2,	xmm2,	11001001b		// xmm2 :  ? Ax Az Ay		mulps	xmm4,	xmm3				// xmm4 =  ? RxAy RzAx RyAz		mulps	xmm2,	xmm1				// xmm2 =  ? RyAx RxAz RzAy		xorps	xmm7,	xmm7				// clear out xmm7 to store -(R X A) later i.e. (A X R)		subps	xmm4,	xmm2				// nasty dependency here. xmm4 = (R X A)		movaps	xmm2,	xmm0				// done with xmm2 so get ready for (R X B) later.		subps	xmm7,	xmm4				// xmm7 now holds (A X R)		mov		ecx,	r1		mov		eax,	r0		mulps	xmm4,	xmm0				// now compute B * (R X A)								// interlaced, we compute C * (A X R)								// and begin computing C * (R X B), 											// R is already primed for cross in xmm3 and xmm1		shufps	xmm2,	xmm2,	11010010b		// xmm2 :  ? By Bx Bz		movhlps	xmm5,	xmm4				// dotting... B * (R X A)		mulps	xmm7,	xmm6				// xmm7 : C * (A X R)		shufps	xmm0,	xmm0,	11001001b		// xmm0 :  ? Bx Bz By		mulps	xmm3,	xmm2				// xmm2 :  ? RxBy RzBx RyBz		addps	xmm5,	xmm4				// dotting... B * (R X A)		mulps	xmm1,	xmm0				// xmm1 :  ? RyBx RxBz RzBy // xmm0 now free.		movhlps	xmm0,	xmm7				// dotting... C * (A X R)		shufps	xmm4,	xmm4,	11000001b		// dotting... B * (R X A)		subps	xmm3,	xmm1				// xmm3 : (R X B)			// xmm1 free.		addps	xmm0,	xmm7				// dotting... C * (A X R)		movaps	xmm1,	[ecx]				// xmm1 : r1		mulps	xmm3,	xmm6				// xmm3 : C * (R X B)		// xmm6 free		movaps	xmm6,	xmm2				// free up xmm2 		shufps	xmm7,	xmm7,	11000001b		// dotting... C * (A X R)		movhlps	xmm6,	xmm3				// dotting... C * (R X B)		addss	xmm4,	xmm5				// xmm4 = dot(B, (R X A))	// xmm5 free		movaps	xmm5,	xmm7				// free up xmm7 to start loading in next tri		mov		edx,	[edi]		shl		edx,	4					// each vert is 16 bytes		add		edx,	t		movaps	xmm2,	[edx]				// preparing next triangle...		addss	xmm5,	xmm0				// xmm5 = dot(C, (A X R))	// xmm0 free		addss	xmm6,	xmm3						movaps	xmm7,	[eax]				// xmm7 : r0					shufps	xmm3,	xmm3,	11000001b		// shuffle y component over to the scalar add portion.		mov		edx,	[edi + 4]		shl		edx,	4			// each vert is 16 bytes		add		edx,	t		movaps	xmm0,	[edx]		addss	xmm6,	xmm3				// all 3 dots computed.		subps	xmm1,	xmm7		unpcklps xmm5,	xmm6				// mov all dot results into xmm5 so that we		movlhps	xmm5,	xmm4				// can movmskps to recover the sign bits		subps	xmm0,	xmm7		mov		edx,	[edi + 8]		shl		edx,	4			// each vert is 16 bytes		add		edx,	t		movaps	xmm6,	[edx]					movmskps eax,	xmm5				// movmskps grabs the msb (sign) from each reg.		subps	xmm2,	xmm7		subps	xmm6,	xmm7				// xmm6 = C; xmm7 free after this.		// here we add the current triangle being tested to the result list as if it were an		// intersection.  Then, we conditionally increment the result_count based on whether		// there was an intersection.  This is all to avoid a branch.		mov		edx,	[results]		add		edx,	result_idx		mov		ecx,	tri_count		mov		[edx],	ecx		mov		edx,	result_idx		mov		ecx,	result_idx		add		ecx,	8		and		eax,	111b			// w-component is meaningless, make sure it doesn't interfere.		cmp		eax,	0		cmove	edx,	ecx		cmp		eax,	111b			// check for back-facing triangles also		cmove	edx,	ecx		mov		result_idx, edx		add		tri_count, 1		sub		num_tris, 1		jnz		start_loop		// we've already loaded the last tri.  Just need to do a last iteration of the loop.		movaps	xmm3,	xmm1		movaps	xmm4,	xmm2		shufps	xmm3,	xmm3,	11001001b		// xmm3 :  ? Rx Rz Ry		shufps	xmm4,	xmm4,	11010010b		// xmm4 :  ? Ay Ax Az		shufps	xmm1,	xmm1,	11010010b		// xmm1 :  ? Ry Rx Rz		shufps	xmm2,	xmm2,	11001001b		// xmm2 :  ? Ax Az Ay		mulps	xmm4,	xmm3				// xmm4 =  ? RxAy RzAx RyAz		mulps	xmm2,	xmm1				// xmm2 =  ? RyAx RxAz RzAy		xorps	xmm7,	xmm7				// clear out xmm7 to store -(R X A) later i.e. (A X R)		subps	xmm4,	xmm2				// nasty dependency here. xmm4 = (R X A)		movaps	xmm2,	xmm0				// done with xmm2 so get ready for (R X B) later.		subps	xmm7,	xmm4				// xmm7 now holds (A X R)		mulps	xmm4,	xmm0				// now compute B * (R X A)											// interlaced, we compute C * (A X R)											// and begin computing C * (R X B), 											// R is already primed for cross in xmm3 and xmm1		shufps	xmm2,	xmm2,	11010010b		// xmm2 :  ? By Bx Bz		movhlps	xmm5,	xmm4				// dotting... B * (R X A)		mulps	xmm7,	xmm6				// xmm7 : C * (A X R)		shufps	xmm0,	xmm0,	11001001b		// xmm0 :  ? Bx Bz By		mulps	xmm3,	xmm2				// xmm2 :  ? RxBy RzBx RyBz		addps	xmm5,	xmm4				// dotting... B * (R X A)		mulps	xmm1,	xmm0				// xmm1 :  ? RyBx RxBz RzBy // xmm0 now free.		movhlps	xmm0,	xmm7				// dotting... C * (A X R)		shufps	xmm4,	xmm4,	11000001b		// dotting... B * (R X A)		subps	xmm3,	xmm1				// xmm3 : (R X B)		addss	xmm0,	xmm7				// dotting... C * (A X R)		mulps	xmm3,	xmm6				// xmm3 : C * (R X B)		shufps	xmm7,	xmm7,	11000001b		// dotting... C * (A X R)		addss	xmm4,	xmm5				// xmm4 = dot(B, (R X A))		movhlps	xmm2,	xmm3				// dotting... C * (R X B)		addss	xmm7,	xmm0				// xmm7 = dot(C, (A X R))		addss	xmm2,	xmm3				// xmm0,1,5,6 all unused now.		shufps	xmm3,	xmm3,	11000001b		addss	xmm2,	xmm3				// all 3 dots computed.		unpcklps xmm7,	xmm2				// trickiness here:		movlhps	xmm7,	xmm4				// arrange the dots in a register and do a		movmskps eax,	xmm7				// movmskps, which grabs the msb (sign) bit from each reg.		mov		edx,	results		add		edx,	result_idx		mov		ecx,	tri_count		mov		[edx],	ecx		mov		edx,	result_idx		mov		ecx,	result_idx		add		ecx,	8		and		eax,	111b			// w-component is meaningless, make sure it doesn't interfere.		cmp		eax,	0		cmove	edx,	ecx		cmp		eax,	111b			// check for back-facing triangles also		cmove	edx,	ecx		sar		edx,	3			// divide out the sizeof(Ray_tri_result)		mov		result_idx, edx	}	return result_idx;}

and to calculate the value of t for each intersecion:
// ray_x_triplane_indexed : results holds an index// into idxs which represents triangles via indexes of vertices in t.  void ray_x_triplane_indexed_SSE(const vec4& r0, 								const vec4& r1, 								const vec4* t, 								const unsigned int* idxs,								Ray_tri_result* results, 								int count){	// this function only works for groups because it preloads the next triangle.	// If the count is 0 or 1, we can't use this function.	if(count == 0)		return;	else if(count == 1)	{		ray_x_triplane_indexed_SSE(r0, r1, t, idxs, results);		return;	}	// calculate t along r0r1 which intersects the plane containing triangle t[0], t[1], t[2]	// mathematically, we calculate (r0 . (v1 x V2)) / ((r0 - r1 + v0) . (v1 X v2))	__asm {		sub		count,	1			// since we're preloading verts, need to test one ahead		mov		edi,	[results]		mov		esi,	0Ch		mov		eax,	[edi]Ray_tri_result.tri_idx		mul		esi		mov		edx,	r1		add		eax,	idxs		mov		esi,	r0		mov		ecx,	[eax]		shl		ecx,	4			// each vert is 16 bytes		add		ecx,	t		movaps	xmm7,	[ecx]				// xmm7 : v0, veed to subtract out v0 from each component.		movaps	xmm0,	[esi]				// xmm0 : r0		mov		ecx,	[eax + 4]		shl		ecx,	4			// each vert is 16 bytes		add		ecx,	t		movaps	xmm1,	[ecx]				// xmm1 : V1		mov		ecx,	[eax + 8]		shl		ecx,	4					// each vert is 16 bytes		add		ecx,	t		movaps	xmm2,	[ecx]				// xmm2 : V2 ; first compute r0 . (V1 x V2)		subps	xmm1,	xmm7		subps	xmm2,	xmm7		movaps	xmm3,	xmm1						movaps	xmm6,	[edx]				// xmm6 : r1 ; but meanwhile start (r0 - r1)		shufps	xmm3,	xmm3,	11001001b		// xmm3 :  ? Bx Bz By		movaps	xmm4,	xmm2		shufps	xmm4,	xmm4,	11010010b		// xmm4 :  ? Cy Cx Cz		shufps	xmm1,	xmm1,	11010010b		// xmm1 :  ? By Bx Bz		shufps	xmm2,	xmm2,	11001001b		// xmm2 :  ? Cx Cz Cystart_loop:		add		edi,	8			// move to next Ray_tri_result (size is 8 bytes)		movaps	xmm5,	xmm0				// copy r0 to start processing denominator.		mulps	xmm4,	xmm3				// xmm4 =  ? BxCy BzCx ByCz		subps	xmm0,	xmm7				// done with xmm7.		subps	xmm5,	xmm6				// denom : (r0 - r1)		mulps	xmm2,	xmm1				// xmm2 =  ? ByCx BxCz BzCy		mov		esi,	0Ch		mov		eax,	[edi]Ray_tri_result.tri_idx		mul		esi		subps	xmm4,	xmm2				// now have (V1 x V2)		mov		edx,	r1		mov		esi,	r0		add		eax,	idxs		mulps	xmm5,	xmm4				// starting (xmm6 . xmm4)		mulps	xmm4,	xmm0				// starting (xmm4 . xmm4)		mov		ecx,	[eax]		shl		ecx,	4			// each vert is 16 bytes		add		ecx,	t		movaps	xmm7,	[ecx]				// xmm7 : v0, veed to subtract out v0 from each component.		movaps	xmm0,	[esi]				// xmm0 : r0		movhlps	xmm6,	xmm5		movhlps	xmm3,	xmm4		mov		ecx,	[eax + 4]		shl		ecx,	4			// each vert is 16 bytes		add		ecx,	t		movaps	xmm1,	[ecx]				// xmm1 : V1		mov		ecx,	[eax + 8]		shl		ecx,	4			// each vert is 16 bytes		add		ecx,	t		movaps	xmm2,	[ecx]				// xmm2 : V2 ; first compute r0 . (V1 x V2)		addss	xmm6,	xmm5				// still dotting...		addss	xmm3,	xmm4				// still dotting...		shufps	xmm5,	xmm5,	11000001b		shufps	xmm4,	xmm4,	11000001b		addss	xmm5,	xmm6		subps	xmm1,	xmm7		movss	xmm6,	xmm4		subps	xmm2,	xmm7		addss	xmm6,	xmm3		movaps	xmm3,	xmm1						movaps	xmm4,	xmm2		divss	xmm6,	xmm5		shufps	xmm1,	xmm1,	11010010b		// xmm1 :  ? By Bx Bz		shufps	xmm2,	xmm2,	11001001b		// xmm2 :  ? Cx Cz Cy		shufps	xmm3,	xmm3,	11001001b		// xmm3 :  ? Bx Bz By		shufps	xmm4,	xmm4,	11010010b		// xmm4 :  ? Cy Cx Cz		movss	[edi-8]Ray_tri_result.t,	xmm6		movaps	xmm6,	[edx]				// xmm6 : r1 ; but meanwhile start (r0 - r1)		sub		count,  1		jnz		start_loop		// now process the last triangle, which is already loaded.		movaps	xmm5,	xmm0				// copy r0 to start processing denominator.		mulps	xmm4,	xmm3				// xmm4 =  ? BxCy BzCx ByCz		subps	xmm0,	xmm7				// done with xmm7.		subps	xmm5,	xmm6				// denom : (r0 - r1)		mulps	xmm2,	xmm1				// xmm2 =  ? ByCx BxCz BzCy		subps	xmm4,	xmm2				// now have (V1 x V2)		mulps	xmm5,	xmm4				// starting (xmm6 . xmm4)		mulps	xmm4,	xmm0				// starting (xmm4 . xmm4)		movhlps	xmm6,	xmm5		movhlps	xmm3,	xmm4		addss	xmm6,	xmm5		addps	xmm3,	xmm4		shufps	xmm5,	xmm5,	11000001b		shufps	xmm4,	xmm4,	11000001b		addss	xmm5,	xmm6		addss	xmm4,	xmm3		divss	xmm4,	xmm5		movss	[edi]Ray_tri_result.t,	xmm4	}}

Share on other sites
damn gamdev doesn't even warn you when you've changed your passwd and it gets filled in wrong!

Anyway, cut'n'paste error on
ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[start_idx], results, count);

should be:
ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[0], results, count);

Share on other sites
Hmm, maybe. Is there a better way ?
Yes normally you could obtain the registers as parameters, the __fastcall calling convention. Alas, the big problem is the compiler bugs when you mix __fastcall and inline asm as soon as VCC decides to inline the function. It looses track of register roles and you get anything in the registers (which should contain the pointers of your vectors here). That's why for small functions, intrisics are the only decent solution ... with VCC.

I think that in the end, the first 2 movs are dispatched together, and as soon as eax is calculated, the movaps starts.
True but an AGI is something else. A stall on the address generation and memory access process in the execution pipeline. Complex superscalar processors (for instance P3) need a address to be known far ahead in time before a read or write (and also a lea). There are so many things coming into play : cache prediction, alignment and boundary tests ... Well I don't know the reasons in details but I know the result from a coder pov : stall. Your pointers are filled just before they are used as dereferencing (mov.. ..., [eax]). Which is bad. So be conscious of this when you treat indexed arrays for instance. Try to unroll and have the indices a bit more in advance. You can combine it with prefetching strategies. That's a bit clutching at straws, but that's asm ...

Sometimes inlined, sometimes not. I never really know unless I look.
I mentionned this because compared to the order of magnitude of your function timings (43 cycles), the call overhaed might be very significant. Probably around 5-10 cycles. And this could make the comparison you made (_ps vs _ss) quite parasited, because of the overall scheduling.

Many times I return values via non-const reference... I'll have to check if that affects inlining.
No it should not. I see no reason. And I remember clearly such functions that inlined. I even add that some wise compilers might consider that if you reference a local variable (thus normally alloced on the stack) as output, it may not be put on the stack, but kept in a register. Visual C++ is capable of this in some cases. Try to write a add_4f(&r, &s, &t) function based on the intrisic for instance. r, might not be saved on the stack. But all of this will not be possible if you use inline asm (ex : movss result, xmm4), VCC does not touch the asm code. Another reason to use intrisics.

Have you had good luck with intrinsics ?
With VCC, and 3DNow yes. With SSE atm yes. But I have not tested as many things as with 3DNow. Normally VCC does not change the instruction order too much, still it does. But in practice you don't loose too much cycles because of the reordering capacities of the CPUs (FPUs). I'd say at worse the 'negative contribution' of VCC is of 10-20% at max on this issue.

Also some of the VC++ intrinsics evaluate to function calls that don't inline! _mm_cvtps_pi16 for example.
I don't have my docs on this PC, and I don't remeber this precise instruction much. But maybe it's because the instruction is emulated for the minimal target processor you selected in the compiler settings.

If you notice, the SIMD version only uses 5 registers while the scalar version uses all 8.
Sure, this might count on a bigger scale. I'll give a try to your routine, rewritting it. But from those of my benchmarks I can remember your final count of 43 cycles seems very big to me. I remember quaternion muls of 16-27 cycles in various SIMD(3DNow/SSE/Altivec) / CC(VCC/GCC) combinations. Which is something comparable, well even more complex probably. That's why I submitted some tracks for potential improvement.

If you look at sony's VUs, you can swizzle vec4 components within the instruction AND they have proper MADD. I wish they'd done that with SSE.
Yes by creating VSIMD (Virtual SIMD : multiplatform library) (*) I noticed that 3DNow could have been very efficient if a perp_2f instruction had been cabled. It is (x,y) -> (-y, x). With this you can define a cross product in a few ops. For SSE I think they should have cabled an acc_4f that fills the vector with (x+y+z+w). But they prefered to sell their SoA strategy. I find that's a pity. Take collision detection or geometric algorithm it most often requires to handle primitives (vertices, triangles, etc ...) one by one. Thus the cross and dot product should have been considered far more.

For your Ray/Tri routine I don't have time to take a detailed look at it atm. But I'll come back to it later, if I see anything relevant to comment.

(*) To anyone who's seen posts about it, an might wonder "WTF r u doing about it ?", just be sure it's not a dead project. Simply it's been quite a hell in my private life recently so in stand by ... till I can finish what I have in mind.

Share on other sites
Quote:
 Original post by Charles BI know the result from a coder pov : stall. Your pointers are filled just before they are used as dereferencing (mov.. ..., [eax])...So be conscious of this when you treat indexed arrays for instance.

Okay, right. I just meant that you only pay for one stall (eax), and you get the others for free... and I didn't know any way to eliminate the one. As for indexed arrays, you're going to really hate my ray-tri code then. The reality is that it ends up only being about 4 cycles (out of 64) slower than the non-indexed triangles code. Calculating the exact memory addr to load and then loading each next vertex gets completely resecheduled by the proccesor so that it just ends up running in stall cycles while the current tri is tested. The three instrs are all dispatched in one cycle, and I didn't know a better way to do it.

Quote:
 (regarding intrinsics...)Normally VCC does not change the instruction order too much, still it does. But in practice you don't loose too much cycles because of the reordering capacities of the CPUs (FPUs).

My worry is not for instruction re-ordering, but for register economisation. Many routines I write so that all SSE registers are saturated all the time... as soon as a register frees up, I want to use it to either start processing the next op or to start preloading the next piece of data for the next loop iteration. I'm not sure that I trust VC to do it best... although you're right that in the past maybe I've had my project settings not set properly to let the compiler do this.

Quote:
 Sure, this might count on a bigger scale. I'll give a try to your routine, rewritting it....For your Ray/Tri routine I don't have time to take a detailed look at it atm. But I'll come back to it later, if I see anything relevant to comment.

If you put determinants in a loop, you could optimize it very well... the movaps and swizzles could be done for free during the huge dot dependency chain, especially because the dot occurs in only 2 registers.

The ray-tri. I didn't mean to include the whole damn thing... I just meant to say "I understand that using fewer registers is very useful in real circumstances. Here's how I know! (within 5 instructions it's interleaving a cross product, 2 dot products, and loading in the next triangle's vertices)" ... but I thought that just the snippet would be meaningless without context so I just pasted the whole thing :)

Quote:
 (*) To anyone who's seen posts about it, an might wonder "WTF r u doing about it ?", just be sure it's not a dead project. Simply it's been quite a hell in my private life recently so in stand by ... till I can finish what I have in mind.

Been wondering. I've been thinking about writing an SSE math syntax that could be run by a seperate (custom) pre-processor and turned into asm SSE code. Not focusing on optimal scheduling but register saturation (the processor does such phenominal work scheduling/re-ordering instructions, it seems the best bet is just to try to prevent redundant loads/stores)... but then it's a huge undertaking and it seems others are further along.

Share on other sites
Quote:
 Original post by ajas95As for indexed arrays, you're going to really hate my ray-tri code then. The reality is that it ends up only being about 4 cycles (out of 64) slower than the non-indexed triangles code.

Indexed arrays spare memory and cache from an algorithmic pov. And they also help otherwise. So it's nice and practical, no question about it.

But since you need an unroll x2 to hide the dot chain (cf later), it will be easy to anticipate address (index) loading. Even if the CPU reordering might help, this can only be better or equal at worst.

Quote:
 I'm not sure that I trust VC to do it best... although you're right that in the past maybe I've had my project settings not set properly to let the compiler do this.

There are many many small details to control for each compiler. It's crazy. I have spent a lot of time on apparently pointless stuff which in the end chnages the benchmarks very consequently. Sometimes >> 100% speed up factors. But that's a very clumsy job, highyly CC dependent, even version dependent and this gives an artificial complexity (wizardry ? ugliness ?) to my HAL (cf later). Else I'll come back to it later, on the issue of register pressure. Still this ugliness is mostly isolated in the few files of the compile time "drivers" below my HAL.

Quote:
 the movaps and swizzles could be done for free during the huge dot dependency chain, especially because the dot occurs in only 2 registers.

That's exactly what you can do wirth an unroll factor of 2, interleaving two "threads" (generic meaning). Thus you need roughly half of the registers (register economisation) for each "thread". I see that you know what to do in the big lines.

Quote:
 Been wondering. I've been thinking about writing an SSE math syntax that could be run by a seperate (custom) pre-processor and turned into asm SSE code.

I have heard of C--. But it does not seem to be totally in that direction. Also some other things. Else yes I also imagined a preprocessor could be an elegant solution. But there is so much work to do, I am too lazy and isolated to do that alone, plus I lack of experience in writing compiler technologies. I prefer to rely on a prediction that compiler will improve SIMD code in the future. Which seems to happen, even slowly, GCC for instance. So my job is basically writing an abstraction layer that takes and standardizes the common or equivalent features of the various instructions sets (3DNow, SSE1, SSE2, Altivec), and rely on intrinsics mostly. This already requires a good deal of analysis, synthesis and creativity. And this experience might serve to build a more detailed preprossing or compiling tool ... if necessary.

Quote:
 Not focusing on optimal scheduling but register saturation (the processor does such phenominal work scheduling/re-ordering instructions, it seems the best bet is just to try to prevent redundant loads/stores)... but then it's a huge undertaking and it seems others are further along.

All in all you can be confident that the compiler will not explode register allocation that you suggest 'him' too much. The most common dead end is :
- If it is so tight that you use (reg, mem) variations. VCC does not respect your decision and split in load and operate, which wastes one register and destroys your nice scheduling puzzle. GCC respects it since I replaced the default intrisics by "intelligent" inline asm.

As a consequence, try to find decent implementations where every intermediate variable is explicitely loaded. And leave some room for the errors of strategy of the compiler. I mean if you have a margin of one or two registers, when the compiler comes into play you are sure it won't break the puzzle. For instance if you explicitely use 7 "registers" and load explicitely, no problem. If you use 6 "registers" and use (reg, mem) ops, then it should also work.

1. 1
2. 2
3. 3
4. 4
frob
15
5. 5

• 16
• 12
• 20
• 12
• 14
• Forum Statistics

• Total Topics
632155
• Total Posts
3004477

×