demonstrations of SSE / 3DNow

Recommended Posts

I know that SSE and 3DNow extensions are one helluva buzz right now; everyone one (including me) is writing their own multi-codepath vector libs. Is there any "neutral" third-party application out there that can demonstrate some realtime benchmark using different extensions? EDIT: ouch - prolly the wrong forum

Share on other sites
I've only seen from the things I've written. There is no advantage to 3dNow. It was a decent first try many years ago, but SSE is much more widespread now.

But if you're going to try to write a vector lib overloading operators and such with SSE, don't bother. SSE can, if written well, give you big speed-ups when you apply it to algorithms as a whole, or operations with sqrts/rsqrts/divs. But if you think that some compiler magic will happen when you call a series of separate SSE asm blocks, you're mistaken. You'll spend so much time moving data in and out of registers that your code will probably get slower!

Over fpu code, a whole algorithm written with SSE is common to get a 2x speed-up. Sometimes up to 2.5x

Share on other sites
Quote:
 Original post by ajas95I've only seen from the things I've written. There is no advantage to 3dNow. It was a decent first try many years ago, but SSE is much more widespread now.But if you're going to try to write a vector lib overloading operators and such with SSE, don't bother. SSE can, if written well, give you big speed-ups when you apply it to algorithms as a whole, or operations with sqrts/rsqrts/divs. But if you think that some compiler magic will happen when you call a series of separate SSE asm blocks, you're mistaken. You'll spend so much time moving data in and out of registers that your code will probably get slower!Over fpu code, a whole algorithm written with SSE is common to get a 2x speed-up. Sometimes up to 2.5x

that was just as I said - the SSE "optimizations" for vector libs are pretty useless; that's why I asked for some other demonstrations, like Cloth / Fluid or other physical simulations :)

Share on other sites
I've written a bunch of SSE Demos. Cloth, vehicles, fishies schooling. Anyway, I just wrote a SSE vector normalize routine that procceses batches of vectors. CodeAnalyst tells me it runs in around 45 cycles per 4 vectors, so 11 cycles each. No too shabby :) I'd expect a fpu version to run in around 50 cycles each... but this is really the ideal case for SSE, in general a 5x speed-up is not possible.

#include <xmmintrin.h>   // for __m128, and if you want to use intrinsics.struct vec4 {	union {		struct { float x, y, z, w };		__m128 v;	};};void normalize4_SSE( vec4* vecs, int count ){	__asm {		mov		eax,	vecs		mov		ecx,	count		shr		ecx,	2		shl		ecx,	6		// sizeof(vec4) * ((count >> 2) << 2) i.e. divisible by 4.		add		ecx,	eax		mov		edx,	eax		// first compute a transpose		movaps	xmm5,	[eax]		movaps	xmm7,	[eax + 20h]		movaps	xmm4,	[eax]		movaps	xmm6,	[eax + 20h]		unpckhps xmm5,	[eax + 10h]		unpckhps xmm7,	[eax + 30h]		unpcklps xmm4,	[eax + 10h]		unpcklps xmm6,	[eax + 30h]		// slightly bizzare way of packing a 4x4 matrix transpose into 4 registers.		// Fancy footwork not really necessary here, but it will be in the loop.		movlhps	xmm5,	xmm7		movaps	xmm7,	xmm4		movlhps	xmm4,	xmm6		movhlps	xmm6,	xmm7	norm_loop:		add		eax,	40h		// all 4 dot products at once here.		mulps	xmm4,	xmm4		mulps	xmm5,	xmm5		mulps	xmm6,	xmm6		addps	xmm4,	xmm5		addps	xmm6,	xmm4		// break up the dependency chain by starting next transpose in xmm4-7		// From now on, he current normalize happens in xmm0-xmm3,		// While the next transpose is in xmm4-7		movaps	xmm4,	[eax]		movaps	xmm5,	[eax]		movaps	xmm7,	[eax + 20h]		rsqrtps	xmm0,	xmm6			// xmm6 free at last!  Thank god almighty...		movaps	xmm6,	[eax + 20h]				unpcklps xmm4,	[eax + 10h]		unpckhps xmm5,	[eax + 10h]		unpcklps xmm6,	[eax + 30h]		unpckhps xmm7,	[eax + 30h]		shufps	xmm3,	xmm0,	11111111b		shufps	xmm2,	xmm0,	10101010b		shufps	xmm1,	xmm0,	01010101b		shufps	xmm0,	xmm0,	00000000b		movhlps	xmm3,	xmm3		movhlps	xmm2,	xmm2		movhlps	xmm1,	xmm1		movlhps	xmm5,	xmm7		movaps	xmm7,	xmm4		// possible speed-up by intermixing the mov*s with mul*s to try to keep		// the different proccessing units busy.  Makes code assy though.		mulps	xmm3,	[edx + 30h]		mulps	xmm2,	[edx + 20h]		mulps	xmm1,	[edx + 10h]		mulps	xmm0,	[edx]		movlhps	xmm4,	xmm6		movhlps	xmm6,	xmm7		movaps	[edx + 30h],	xmm3		movaps	[edx + 20h],	xmm2		movaps	[edx + 10h],	xmm1		movaps	[edx      ],	xmm0		add		edx,	40h		cmp		eax,	ecx		jne		norm_loop	}}

Note that this version reads one iteration beyond 'count' vec4s, and only works with count % 4 == 0. Of course, that's all easy to fix, this way just keeps the code most compact.

Create an account

Register a new account

• Forum Statistics

• Total Topics
627701
• Total Posts
2978698

• 21
• 14
• 12
• 10
• 12