Sign in to follow this  

Beware the SSE beast...

This topic is 4742 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've been using a lot of SSE lately to speed up the math heavy parts of my demos. I've been blown away by the speed improvements I've been getting. BUT, I do have to say that for scalar ops, it really really sucks. Take, for example, determinants. I wrote a little SSE function to calculate the determinant of a 3x3 matrix, formed by vectors v0, v1, and v2 (which is just v0 . (v1 x v2))
float det(const vec4& v0, const vec4& v1, const vec4& v2)
{
	float result;

	__asm {
		mov	eax,	v0
		mov	ecx,	v1
		mov	edx,	v2

		movaps	xmm3,	[ecx]
		movaps	xmm2,	[edx]
		movaps	xmm4,	[edx]
		movaps	xmm1,	[ecx]
		movaps	xmm0,	[eax]			// xmm0 :  ? Az Ay Ax
		shufps	xmm3,	xmm3,	11001001b	// xmm3 :  ? Bx Bz By
		shufps	xmm4,	xmm4,	11010010b	// xmm4 :  ? Cy Cx Cz
		shufps	xmm1,	xmm1,	11010010b	// xmm1 :  ? By Bx Bz
		shufps	xmm2,	xmm2,	11001001b	// xmm2 :  ? Cx Cz Cy
		mulps	xmm4,	xmm3			// xmm4 =  ? BxCy BzCx ByCz
		mulps	xmm2,	xmm1			// xmm2 =  ? ByCx BxCz BzCy
		subps	xmm4,	xmm2			// xmm4 = v1 x v2
		mulps	xmm4,	xmm0			// now dot v0 . xmm4
		movhlps	xmm5,	xmm4
		addss	xmm5,	xmm4			// standard dot product.  The dependency chain here is very bad.
		shufps	xmm4,	xmm4,	11000001b
		addss	xmm4,	xmm5
		movss	result,	xmm4
	}

	return result;
}



not bad. 50 cycles. But then just for kicks I decided to back and write a scalar version just to see what would happen if I got rid of all those stupid shufps and movhlps.
float det_scalar(const vec4& v0, const vec4& v1, const vec4& v2)
{
	float result;

	__asm {
		mov	eax,	v0
		mov	ecx,	v1
		mov	edx,	v2

		movss	xmm0,	[ecx]vec4.y
		movss	xmm1,	[edx]vec4.z
		movss	xmm2,	[ecx]vec4.x
		movss	xmm4,	xmm0
		movss	xmm5,	xmm2
		movss	xmm3,	[edx]vec4.x
		movss	xmm7,	[ecx]vec4.z
		movss	xmm6,	[edx]vec4.y
		mulss	xmm0,	xmm1			// xmm0 : ByCz
		mulss	xmm2,	xmm1			// xmm2 : BxCz		: xmm1 free
		mulss	xmm4,	xmm3			// xmm4 : ByCx
		movss	xmm1,	[ecx]vec4.z		// xmm1 : Bz
		mulss	xmm5,	xmm6			// xmm5 : BxCy
		mulss	xmm7,	xmm6			// xmm7 : BzCy		: xmm6 free
		mulss	xmm1,	xmm3			// xmm1 : BzCx		: xmm3 free
		subss	xmm5,	xmm4			// xmm5 : BxCy - ByCx	: xmm4 free
		movss	xmm6,	[eax]vec4.x
		subss	xmm1,	xmm2			// xmm2 : BzCx - BxCz	 : xmm2 free
		subss	xmm0,	xmm7			// xmm0 : ByCz - BzCy	 : xmm7 free
		movss	xmm3,	[eax]vec4.z
		movss	xmm7,	[eax]vec4.y
		mulss	xmm0,	xmm6			// xmm0 : Ax * (ByCz - BzCy)
		mulss	xmm3,	xmm5			// xmm3 : Az * (BxCy - ByCx)
		mulss	xmm7,	xmm1			// xmm7 : Ay * (BzCx - BxCz)
		addss	xmm0,	xmm3
		addss	xmm0,	xmm7
		movss	result, xmm0
	}
	
	return result;
}




43 cycles! Almost a 15% improvement by converting the vector ops to scalar. SIMD my ass. Does anyone have a fast fpu determinant? maybe if scalar SSE is so much faster than SIMD SSE, maybe fpu will be fastest of all?

Share this post


Link to post
Share on other sites
1) This costs you through an AGI stall :

mov eax, v0
mov ecx, v1
mov edx, v2


2) Also your function will probably not be inlined, thus call/ret overhead. Specially if params are stored on the stack.


A way to get rid of this all with Visual : prefer the C Intel intrisics to asm. Only the inline asm of GCC is worth the deal. Calling conventions and fixed register allocation makes Visual C++ inline asm worthless unless you code a loop routine that hides the overheads as something marginal.

3) movss result, xmm4
return result
Also costs additional dependencies chains plus traffic on the stack. My way to do is use a scalar type (or C++ class) that is not a float. Usually it will be a duplicated float (s,s,s,s). This way no conversion cost and if you want to scale a vector by this determinant next, ir's immediate.

4) Last point, it might be possible to schedule the code a bit more.

All in all you'll see more clearly what SSE can actually do. Still the main problem is there are not enough ops to achieve full parallelism. Mostly two threads : the right and left parts of the cross product. And there is a long chain to make the final dot product. That's why FPU or ss SSE might beat ps here.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Charles B
1) This costs you through an AGI stall :

mov eax, v0
mov ecx, v1
mov edx, v2


Hmm, maybe. is there a better way? I think that in the end, the first 2 movs are dispatched together, and as soon as eax is calculated, the movaps starts. And while the other addresses are generated, v0 is being shufpsed so it's all parallelized anyway. I think I've gotten too used to writing really bad integer code since it all gets hidden by the SSE code running in parallel.

Quote:

2) Also your function will probably not be inlined, thus call/ret overhead. Specially if params are stored on the stack.


Sometimes inlined, sometimes not. I never really know unless I look. Many times I return values via non-const reference... I'll have to check if that affects inlining.

Quote:

A way to get rid of this all with Visual : prefer the C Intel intrisics to asm.


Have you had good luck with intrinsics? I felt like I tried it once and was disappointed by the scheduling/register alloc. Also some of the VC++ intrinsics evaluate to function calls that don't inline! _mm_cvtps_pi16 for example. Also, I think intrinsic is harder to read than asm (asm is no picnic either, don't get me wrong).

Quote:

Visual C++ inline asm worthless unless you code a loop routine that hides the overheads as something marginal.


yeah, that's definitely what I (almost) always do (see below). The determinant thing was just for kicks... I'd always been suspicious of cross-product as I typed in those shufps that maybe a scalar would be faster and decided to try. If you notice, the SIMD version only uses 5 registers while the scalar version uses all 8, so you can interleave other ops (and like below, after I compute (R x A) and need (R x B), R is already shuffled into position for a cross product, so it essentially gets 2 more registers 'for free'.

Quote:

4) Last point, it might be possible to schedule the code a bit more.
...
And there is a long chain to make the final dot product. That's why FPU or ss SSE might beat ps here.


Right, I noticed that movaps xmm2 should be after movaps xmm1... that little blunder probably propogates down the entire chain and costs a couple cycles by the end. If you look at sony's VUs, you can swizzle vec4 components within the instruction AND they have proper MADD. I wish they'd done that with SSE.

Anyway, just so you know I'm not knocking SSE too hard, here is a good example of better parallelism. It takes a ray(r0, r1) list of vertices (t*), and indices of triangles in t (idxs*), tests the (bi-directional, infinite) ray for intersection with each triangle. Indexes of intersected triangles are returned in results.tri_idx, with the function returning the # of intersections. Ray_x_triplane then takes the list of results and calculates result.t for the ray (r0 + t*(r1-r0)).

I use it like:

Ray_tri_result results[result_size];

// count is the number of intersections in the current batch.
int count = ray_x_tri_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[0], results, result_size);

// fill in the results.t values for intersected triangles.
ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[start_idx], results, count);




struct Ray_tri_result
{
int tri_idx; // triangle formed by idxs[tri_idx*3], idxs[tri_idx*3 + 1], and idxs[tri_idx*3 + 2] intersects.
float t; // value of t along ray of intersection.
};

// use polyhedral volumes to determine whether a ray will intersect a triangle. Only determines
// whether ray will intersect interior of triangle for some value t in r0 + (r1 - r0)t,
// not necessarily 0 <= t <= 1.0f. If this test succeeds, call ray_x_triplane to determine the
// exact value of t at intersection.

int ray_x_tri_indexed_SSE(const vec4& r0, const vec4& r1, const vec4* t, const unsigned int* idxs, Ray_tri_result* results, int num_tris )
{
int result_idx = 0;
int tri_count = 0;

// this function only works for groups because it preloads the next triangle.
// If the count is 0 or 1, we can't use this function.

if(num_tris == 0)
return 0;

else if(num_tris == 1)
{
return ray_x_tri_indexed(r0, r1, t, idxs, results);
}

num_tris--;


__asm {
mov edi, idxs
mov eax, r0
movaps xmm7, [eax] // xmm7 : r0
mov edx, [edi + 4] // here edx gets t1
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm0, [edx] // xmm0 : t1
mov ecx, r1
movaps xmm1, [ecx] // xmm1 : r1
subps xmm0, xmm7 // subtract out ray origin from each vertex
mov edx, [edi]
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm2, [edx] // xmm2 : t0
subps xmm1, xmm7
subps xmm2, xmm7
mov edx, [edi + 8]
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm6, [edx] // put in xmm6 because we don't need it til later
subps xmm6, xmm7 // xmm6 = C; xmm7 free after this. Stall here, but only once.

start_loop:
// R = r1 - r0 : xmm1
// A = t0 - r0 : xmm2
// B = t1 - r0 : xmm0
// C = t2 - r0 : xmm6

// compute:
// det(R A B) as B . (R x A)
// det(R B C) as C . (R x B)
// det(R C A) as C .-(R x A)

// if signs of 3 determinants are the same, then there is intersection.

add edi, 0Ch // advance 3 verts * 4 bytes/index
movaps xmm3, xmm1 // start R x A
movaps xmm4, xmm2
shufps xmm3, xmm3, 11001001b // xmm3 : ? Rx Rz Ry
shufps xmm4, xmm4, 11010010b // xmm4 : ? Ay Ax Az
shufps xmm1, xmm1, 11010010b // xmm1 : ? Ry Rx Rz
shufps xmm2, xmm2, 11001001b // xmm2 : ? Ax Az Ay
mulps xmm4, xmm3 // xmm4 = ? RxAy RzAx RyAz
mulps xmm2, xmm1 // xmm2 = ? RyAx RxAz RzAy
xorps xmm7, xmm7 // clear out xmm7 to store -(R X A) later i.e. (A X R)
subps xmm4, xmm2 // nasty dependency here. xmm4 = (R X A)
movaps xmm2, xmm0 // done with xmm2 so get ready for (R X B) later.
subps xmm7, xmm4 // xmm7 now holds (A X R)
mov ecx, r1
mov eax, r0
mulps xmm4, xmm0 // now compute B * (R X A)
// interlaced, we compute C * (A X R)
// and begin computing C * (R X B),
// R is already primed for cross in xmm3 and xmm1
shufps xmm2, xmm2, 11010010b // xmm2 : ? By Bx Bz
movhlps xmm5, xmm4 // dotting... B * (R X A)
mulps xmm7, xmm6 // xmm7 : C * (A X R)
shufps xmm0, xmm0, 11001001b // xmm0 : ? Bx Bz By
mulps xmm3, xmm2 // xmm2 : ? RxBy RzBx RyBz
addps xmm5, xmm4 // dotting... B * (R X A)
mulps xmm1, xmm0 // xmm1 : ? RyBx RxBz RzBy // xmm0 now free.
movhlps xmm0, xmm7 // dotting... C * (A X R)
shufps xmm4, xmm4, 11000001b // dotting... B * (R X A)
subps xmm3, xmm1 // xmm3 : (R X B) // xmm1 free.
addps xmm0, xmm7 // dotting... C * (A X R)
movaps xmm1, [ecx] // xmm1 : r1
mulps xmm3, xmm6 // xmm3 : C * (R X B) // xmm6 free
movaps xmm6, xmm2 // free up xmm2
shufps xmm7, xmm7, 11000001b // dotting... C * (A X R)
movhlps xmm6, xmm3 // dotting... C * (R X B)
addss xmm4, xmm5 // xmm4 = dot(B, (R X A)) // xmm5 free
movaps xmm5, xmm7 // free up xmm7 to start loading in next tri
mov edx, [edi]
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm2, [edx] // preparing next triangle...
addss xmm5, xmm0 // xmm5 = dot(C, (A X R)) // xmm0 free
addss xmm6, xmm3
movaps xmm7, [eax] // xmm7 : r0
shufps xmm3, xmm3, 11000001b // shuffle y component over to the scalar add portion.
mov edx, [edi + 4]
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm0, [edx]
addss xmm6, xmm3 // all 3 dots computed.
subps xmm1, xmm7
unpcklps xmm5, xmm6 // mov all dot results into xmm5 so that we
movlhps xmm5, xmm4 // can movmskps to recover the sign bits
subps xmm0, xmm7
mov edx, [edi + 8]
shl edx, 4 // each vert is 16 bytes
add edx, t
movaps xmm6, [edx]
movmskps eax, xmm5 // movmskps grabs the msb (sign) from each reg.

subps xmm2, xmm7
subps xmm6, xmm7 // xmm6 = C; xmm7 free after this.

// here we add the current triangle being tested to the result list as if it were an
// intersection. Then, we conditionally increment the result_count based on whether
// there was an intersection. This is all to avoid a branch.

mov edx, [results]
add edx, result_idx
mov ecx, tri_count
mov [edx], ecx
mov edx, result_idx
mov ecx, result_idx
add ecx, 8
and eax, 111b // w-component is meaningless, make sure it doesn't interfere.
cmp eax, 0
cmove edx, ecx
cmp eax, 111b // check for back-facing triangles also
cmove edx, ecx
mov result_idx, edx
add tri_count, 1
sub num_tris, 1
jnz start_loop

// we've already loaded the last tri. Just need to do a last iteration of the loop.

movaps xmm3, xmm1
movaps xmm4, xmm2
shufps xmm3, xmm3, 11001001b // xmm3 : ? Rx Rz Ry
shufps xmm4, xmm4, 11010010b // xmm4 : ? Ay Ax Az
shufps xmm1, xmm1, 11010010b // xmm1 : ? Ry Rx Rz
shufps xmm2, xmm2, 11001001b // xmm2 : ? Ax Az Ay
mulps xmm4, xmm3 // xmm4 = ? RxAy RzAx RyAz
mulps xmm2, xmm1 // xmm2 = ? RyAx RxAz RzAy
xorps xmm7, xmm7 // clear out xmm7 to store -(R X A) later i.e. (A X R)
subps xmm4, xmm2 // nasty dependency here. xmm4 = (R X A)
movaps xmm2, xmm0 // done with xmm2 so get ready for (R X B) later.
subps xmm7, xmm4 // xmm7 now holds (A X R)
mulps xmm4, xmm0 // now compute B * (R X A)
// interlaced, we compute C * (A X R)
// and begin computing C * (R X B),
// R is already primed for cross in xmm3 and xmm1
shufps xmm2, xmm2, 11010010b // xmm2 : ? By Bx Bz
movhlps xmm5, xmm4 // dotting... B * (R X A)
mulps xmm7, xmm6 // xmm7 : C * (A X R)
shufps xmm0, xmm0, 11001001b // xmm0 : ? Bx Bz By
mulps xmm3, xmm2 // xmm2 : ? RxBy RzBx RyBz
addps xmm5, xmm4 // dotting... B * (R X A)
mulps xmm1, xmm0 // xmm1 : ? RyBx RxBz RzBy // xmm0 now free.
movhlps xmm0, xmm7 // dotting... C * (A X R)
shufps xmm4, xmm4, 11000001b // dotting... B * (R X A)
subps xmm3, xmm1 // xmm3 : (R X B)
addss xmm0, xmm7 // dotting... C * (A X R)
mulps xmm3, xmm6 // xmm3 : C * (R X B)
shufps xmm7, xmm7, 11000001b // dotting... C * (A X R)
addss xmm4, xmm5 // xmm4 = dot(B, (R X A))
movhlps xmm2, xmm3 // dotting... C * (R X B)
addss xmm7, xmm0 // xmm7 = dot(C, (A X R))
addss xmm2, xmm3 // xmm0,1,5,6 all unused now.
shufps xmm3, xmm3, 11000001b
addss xmm2, xmm3 // all 3 dots computed.
unpcklps xmm7, xmm2 // trickiness here:
movlhps xmm7, xmm4 // arrange the dots in a register and do a
movmskps eax, xmm7 // movmskps, which grabs the msb (sign) bit from each reg.

mov edx, results
add edx, result_idx
mov ecx, tri_count
mov [edx], ecx
mov edx, result_idx
mov ecx, result_idx
add ecx, 8
and eax, 111b // w-component is meaningless, make sure it doesn't interfere.
cmp eax, 0
cmove edx, ecx
cmp eax, 111b // check for back-facing triangles also
cmove edx, ecx
sar edx, 3 // divide out the sizeof(Ray_tri_result)
mov result_idx, edx
}

return result_idx;
}



and to calculate the value of t for each intersecion:

// ray_x_triplane_indexed : results holds an index
// into idxs which represents triangles via indexes of vertices in t.

void ray_x_triplane_indexed_SSE(const vec4& r0,
const vec4& r1,
const vec4* t,
const unsigned int* idxs,
Ray_tri_result* results,
int count)
{
// this function only works for groups because it preloads the next triangle.
// If the count is 0 or 1, we can't use this function.

if(count == 0)
return;

else if(count == 1)
{
ray_x_triplane_indexed_SSE(r0, r1, t, idxs, results);
return;
}

// calculate t along r0r1 which intersects the plane containing triangle t[0], t[1], t[2]
// mathematically, we calculate (r0 . (v1 x V2)) / ((r0 - r1 + v0) . (v1 X v2))
__asm {
sub count, 1 // since we're preloading verts, need to test one ahead
mov edi, [results]
mov esi, 0Ch
mov eax, [edi]Ray_tri_result.tri_idx
mul esi
mov edx, r1
add eax, idxs
mov esi, r0

mov ecx, [eax]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm7, [ecx] // xmm7 : v0, veed to subtract out v0 from each component.
movaps xmm0, [esi] // xmm0 : r0
mov ecx, [eax + 4]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm1, [ecx] // xmm1 : V1
mov ecx, [eax + 8]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm2, [ecx] // xmm2 : V2 ; first compute r0 . (V1 x V2)
subps xmm1, xmm7
subps xmm2, xmm7
movaps xmm3, xmm1
movaps xmm6, [edx] // xmm6 : r1 ; but meanwhile start (r0 - r1)
shufps xmm3, xmm3, 11001001b // xmm3 : ? Bx Bz By
movaps xmm4, xmm2
shufps xmm4, xmm4, 11010010b // xmm4 : ? Cy Cx Cz
shufps xmm1, xmm1, 11010010b // xmm1 : ? By Bx Bz
shufps xmm2, xmm2, 11001001b // xmm2 : ? Cx Cz Cy

start_loop:

add edi, 8 // move to next Ray_tri_result (size is 8 bytes)
movaps xmm5, xmm0 // copy r0 to start processing denominator.
mulps xmm4, xmm3 // xmm4 = ? BxCy BzCx ByCz
subps xmm0, xmm7 // done with xmm7.
subps xmm5, xmm6 // denom : (r0 - r1)
mulps xmm2, xmm1 // xmm2 = ? ByCx BxCz BzCy
mov esi, 0Ch
mov eax, [edi]Ray_tri_result.tri_idx
mul esi
subps xmm4, xmm2 // now have (V1 x V2)
mov edx, r1
mov esi, r0
add eax, idxs
mulps xmm5, xmm4 // starting (xmm6 . xmm4)
mulps xmm4, xmm0 // starting (xmm4 . xmm4)
mov ecx, [eax]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm7, [ecx] // xmm7 : v0, veed to subtract out v0 from each component.
movaps xmm0, [esi] // xmm0 : r0
movhlps xmm6, xmm5
movhlps xmm3, xmm4
mov ecx, [eax + 4]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm1, [ecx] // xmm1 : V1
mov ecx, [eax + 8]
shl ecx, 4 // each vert is 16 bytes
add ecx, t
movaps xmm2, [ecx] // xmm2 : V2 ; first compute r0 . (V1 x V2)
addss xmm6, xmm5 // still dotting...
addss xmm3, xmm4 // still dotting...
shufps xmm5, xmm5, 11000001b
shufps xmm4, xmm4, 11000001b
addss xmm5, xmm6
subps xmm1, xmm7
movss xmm6, xmm4
subps xmm2, xmm7
addss xmm6, xmm3
movaps xmm3, xmm1
movaps xmm4, xmm2
divss xmm6, xmm5
shufps xmm1, xmm1, 11010010b // xmm1 : ? By Bx Bz
shufps xmm2, xmm2, 11001001b // xmm2 : ? Cx Cz Cy
shufps xmm3, xmm3, 11001001b // xmm3 : ? Bx Bz By
shufps xmm4, xmm4, 11010010b // xmm4 : ? Cy Cx Cz
movss [edi-8]Ray_tri_result.t, xmm6
movaps xmm6, [edx] // xmm6 : r1 ; but meanwhile start (r0 - r1)

sub count, 1
jnz start_loop

// now process the last triangle, which is already loaded.

movaps xmm5, xmm0 // copy r0 to start processing denominator.
mulps xmm4, xmm3 // xmm4 = ? BxCy BzCx ByCz
subps xmm0, xmm7 // done with xmm7.
subps xmm5, xmm6 // denom : (r0 - r1)
mulps xmm2, xmm1 // xmm2 = ? ByCx BxCz BzCy
subps xmm4, xmm2 // now have (V1 x V2)
mulps xmm5, xmm4 // starting (xmm6 . xmm4)
mulps xmm4, xmm0 // starting (xmm4 . xmm4)
movhlps xmm6, xmm5
movhlps xmm3, xmm4
addss xmm6, xmm5
addps xmm3, xmm4
shufps xmm5, xmm5, 11000001b
shufps xmm4, xmm4, 11000001b
addss xmm5, xmm6
addss xmm4, xmm3
divss xmm4, xmm5
movss [edi]Ray_tri_result.t, xmm4

}
}





Share this post


Link to post
Share on other sites
damn gamdev doesn't even warn you when you've changed your passwd and it gets filled in wrong!

Anyway, cut'n'paste error on

ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[start_idx], results, count);

should be:

ray_x_triplane_indexed_SSE(r0, r1, &m_verts[0], &m_idxs[0], results, count);

Share this post


Link to post
Share on other sites
Hmm, maybe. Is there a better way ?
Yes normally you could obtain the registers as parameters, the __fastcall calling convention. Alas, the big problem is the compiler bugs when you mix __fastcall and inline asm as soon as VCC decides to inline the function. It looses track of register roles and you get anything in the registers (which should contain the pointers of your vectors here). That's why for small functions, intrisics are the only decent solution ... with VCC.


I think that in the end, the first 2 movs are dispatched together, and as soon as eax is calculated, the movaps starts.
True but an AGI is something else. A stall on the address generation and memory access process in the execution pipeline. Complex superscalar processors (for instance P3) need a address to be known far ahead in time before a read or write (and also a lea). There are so many things coming into play : cache prediction, alignment and boundary tests ... Well I don't know the reasons in details but I know the result from a coder pov : stall. Your pointers are filled just before they are used as dereferencing (mov.. ..., [eax]). Which is bad. So be conscious of this when you treat indexed arrays for instance. Try to unroll and have the indices a bit more in advance. You can combine it with prefetching strategies. That's a bit clutching at straws, but that's asm ...


Sometimes inlined, sometimes not. I never really know unless I look.
I mentionned this because compared to the order of magnitude of your function timings (43 cycles), the call overhaed might be very significant. Probably around 5-10 cycles. And this could make the comparison you made (_ps vs _ss) quite parasited, because of the overall scheduling.


Many times I return values via non-const reference... I'll have to check if that affects inlining.
No it should not. I see no reason. And I remember clearly such functions that inlined. I even add that some wise compilers might consider that if you reference a local variable (thus normally alloced on the stack) as output, it may not be put on the stack, but kept in a register. Visual C++ is capable of this in some cases. Try to write a add_4f(&r, &s, &t) function based on the intrisic for instance. r, might not be saved on the stack. But all of this will not be possible if you use inline asm (ex : movss result, xmm4), VCC does not touch the asm code. Another reason to use intrisics.

Have you had good luck with intrinsics ?
With VCC, and 3DNow yes. With SSE atm yes. But I have not tested as many things as with 3DNow. Normally VCC does not change the instruction order too much, still it does. But in practice you don't loose too much cycles because of the reordering capacities of the CPUs (FPUs). I'd say at worse the 'negative contribution' of VCC is of 10-20% at max on this issue.


Also some of the VC++ intrinsics evaluate to function calls that don't inline! _mm_cvtps_pi16 for example.
I don't have my docs on this PC, and I don't remeber this precise instruction much. But maybe it's because the instruction is emulated for the minimal target processor you selected in the compiler settings.


If you notice, the SIMD version only uses 5 registers while the scalar version uses all 8.
Sure, this might count on a bigger scale. I'll give a try to your routine, rewritting it. But from those of my benchmarks I can remember your final count of 43 cycles seems very big to me. I remember quaternion muls of 16-27 cycles in various SIMD(3DNow/SSE/Altivec) / CC(VCC/GCC) combinations. Which is something comparable, well even more complex probably. That's why I submitted some tracks for potential improvement.


If you look at sony's VUs, you can swizzle vec4 components within the instruction AND they have proper MADD. I wish they'd done that with SSE.
Yes by creating VSIMD (Virtual SIMD : multiplatform library) (*) I noticed that 3DNow could have been very efficient if a perp_2f instruction had been cabled. It is (x,y) -> (-y, x). With this you can define a cross product in a few ops. For SSE I think they should have cabled an acc_4f that fills the vector with (x+y+z+w). But they prefered to sell their SoA strategy. I find that's a pity. Take collision detection or geometric algorithm it most often requires to handle primitives (vertices, triangles, etc ...) one by one. Thus the cross and dot product should have been considered far more.

For your Ray/Tri routine I don't have time to take a detailed look at it atm. But I'll come back to it later, if I see anything relevant to comment.

(*) To anyone who's seen posts about it, an might wonder "WTF r u doing about it ?", just be sure it's not a dead project. Simply it's been quite a hell in my private life recently so in stand by ... till I can finish what I have in mind.

Share this post


Link to post
Share on other sites
Quote:
Original post by Charles B
I know the result from a coder pov : stall. Your pointers are filled just before they are used as dereferencing (mov.. ..., [eax])
...
So be conscious of this when you treat indexed arrays for instance.

Okay, right. I just meant that you only pay for one stall (eax), and you get the others for free... and I didn't know any way to eliminate the one. As for indexed arrays, you're going to really hate my ray-tri code then. The reality is that it ends up only being about 4 cycles (out of 64) slower than the non-indexed triangles code. Calculating the exact memory addr to load and then loading each next vertex gets completely resecheduled by the proccesor so that it just ends up running in stall cycles while the current tri is tested. The three instrs are all dispatched in one cycle, and I didn't know a better way to do it.

Quote:

(regarding intrinsics...)
Normally VCC does not change the instruction order too much, still it does. But in practice you don't loose too much cycles because of the reordering capacities of the CPUs (FPUs).

My worry is not for instruction re-ordering, but for register economisation. Many routines I write so that all SSE registers are saturated all the time... as soon as a register frees up, I want to use it to either start processing the next op or to start preloading the next piece of data for the next loop iteration. I'm not sure that I trust VC to do it best... although you're right that in the past maybe I've had my project settings not set properly to let the compiler do this.

Quote:

Sure, this might count on a bigger scale. I'll give a try to your routine, rewritting it.
...
For your Ray/Tri routine I don't have time to take a detailed look at it atm. But I'll come back to it later, if I see anything relevant to comment.

If you put determinants in a loop, you could optimize it very well... the movaps and swizzles could be done for free during the huge dot dependency chain, especially because the dot occurs in only 2 registers.

The ray-tri. I didn't mean to include the whole damn thing... I just meant to say "I understand that using fewer registers is very useful in real circumstances. Here's how I know! (within 5 instructions it's interleaving a cross product, 2 dot products, and loading in the next triangle's vertices)" ... but I thought that just the snippet would be meaningless without context so I just pasted the whole thing :)


Quote:

(*) To anyone who's seen posts about it, an might wonder "WTF r u doing about it ?", just be sure it's not a dead project. Simply it's been quite a hell in my private life recently so in stand by ... till I can finish what I have in mind.

Been wondering. I've been thinking about writing an SSE math syntax that could be run by a seperate (custom) pre-processor and turned into asm SSE code. Not focusing on optimal scheduling but register saturation (the processor does such phenominal work scheduling/re-ordering instructions, it seems the best bet is just to try to prevent redundant loads/stores)... but then it's a huge undertaking and it seems others are further along.

Share this post


Link to post
Share on other sites
Quote:
Original post by ajas95
As for indexed arrays, you're going to really hate my ray-tri code then. The reality is that it ends up only being about 4 cycles (out of 64) slower than the non-indexed triangles code.

Indexed arrays spare memory and cache from an algorithmic pov. And they also help otherwise. So it's nice and practical, no question about it.

But since you need an unroll x2 to hide the dot chain (cf later), it will be easy to anticipate address (index) loading. Even if the CPU reordering might help, this can only be better or equal at worst.

Quote:

I'm not sure that I trust VC to do it best... although you're right that in the past maybe I've had my project settings not set properly to let the compiler do this.

There are many many small details to control for each compiler. It's crazy. I have spent a lot of time on apparently pointless stuff which in the end chnages the benchmarks very consequently. Sometimes >> 100% speed up factors. But that's a very clumsy job, highyly CC dependent, even version dependent and this gives an artificial complexity (wizardry ? ugliness ?) to my HAL (cf later). Else I'll come back to it later, on the issue of register pressure. Still this ugliness is mostly isolated in the few files of the compile time "drivers" below my HAL.

Quote:

the movaps and swizzles could be done for free during the huge dot dependency chain, especially because the dot occurs in only 2 registers.

That's exactly what you can do wirth an unroll factor of 2, interleaving two "threads" (generic meaning). Thus you need roughly half of the registers (register economisation) for each "thread". I see that you know what to do in the big lines.


Quote:

Been wondering. I've been thinking about writing an SSE math syntax that could be run by a seperate (custom) pre-processor and turned into asm SSE code.

I have heard of C--. But it does not seem to be totally in that direction. Also some other things. Else yes I also imagined a preprocessor could be an elegant solution. But there is so much work to do, I am too lazy and isolated to do that alone, plus I lack of experience in writing compiler technologies. I prefer to rely on a prediction that compiler will improve SIMD code in the future. Which seems to happen, even slowly, GCC for instance. So my job is basically writing an abstraction layer that takes and standardizes the common or equivalent features of the various instructions sets (3DNow, SSE1, SSE2, Altivec), and rely on intrinsics mostly. This already requires a good deal of analysis, synthesis and creativity. And this experience might serve to build a more detailed preprossing or compiling tool ... if necessary.

Quote:

Not focusing on optimal scheduling but register saturation (the processor does such phenominal work scheduling/re-ordering instructions, it seems the best bet is just to try to prevent redundant loads/stores)... but then it's a huge undertaking and it seems others are further along.

All in all you can be confident that the compiler will not explode register allocation that you suggest 'him' too much. The most common dead end is :
- If it is so tight that you use (reg, mem) variations. VCC does not respect your decision and split in load and operate, which wastes one register and destroys your nice scheduling puzzle. GCC respects it since I replaced the default intrisics by "intelligent" inline asm.

As a consequence, try to find decent implementations where every intermediate variable is explicitely loaded. And leave some room for the errors of strategy of the compiler. I mean if you have a margin of one or two registers, when the compiler comes into play you are sure it won't break the puzzle. For instance if you explicitely use 7 "registers" and load explicitely, no problem. If you use 6 "registers" and use (reg, mem) ops, then it should also work.


Share this post


Link to post
Share on other sites

This topic is 4742 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this