Sign in to follow this  
Metus

demonstrations of SSE / 3DNow

Recommended Posts

I know that SSE and 3DNow extensions are one helluva buzz right now; everyone one (including me) is writing their own multi-codepath vector libs. Is there any "neutral" third-party application out there that can demonstrate some realtime benchmark using different extensions? EDIT: ouch - prolly the wrong forum

Share this post


Link to post
Share on other sites
I've only seen from the things I've written. There is no advantage to 3dNow. It was a decent first try many years ago, but SSE is much more widespread now.

But if you're going to try to write a vector lib overloading operators and such with SSE, don't bother. SSE can, if written well, give you big speed-ups when you apply it to algorithms as a whole, or operations with sqrts/rsqrts/divs. But if you think that some compiler magic will happen when you call a series of separate SSE asm blocks, you're mistaken. You'll spend so much time moving data in and out of registers that your code will probably get slower!

Over fpu code, a whole algorithm written with SSE is common to get a 2x speed-up. Sometimes up to 2.5x

Share this post


Link to post
Share on other sites
Quote:
Original post by ajas95
I've only seen from the things I've written. There is no advantage to 3dNow. It was a decent first try many years ago, but SSE is much more widespread now.

But if you're going to try to write a vector lib overloading operators and such with SSE, don't bother. SSE can, if written well, give you big speed-ups when you apply it to algorithms as a whole, or operations with sqrts/rsqrts/divs. But if you think that some compiler magic will happen when you call a series of separate SSE asm blocks, you're mistaken. You'll spend so much time moving data in and out of registers that your code will probably get slower!

Over fpu code, a whole algorithm written with SSE is common to get a 2x speed-up. Sometimes up to 2.5x


that was just as I said - the SSE "optimizations" for vector libs are pretty useless; that's why I asked for some other demonstrations, like Cloth / Fluid or other physical simulations :)

Share this post


Link to post
Share on other sites
I've written a bunch of SSE Demos. Cloth, vehicles, fishies schooling. Anyway, I just wrote a SSE vector normalize routine that procceses batches of vectors. CodeAnalyst tells me it runs in around 45 cycles per 4 vectors, so 11 cycles each. No too shabby :) I'd expect a fpu version to run in around 50 cycles each... but this is really the ideal case for SSE, in general a 5x speed-up is not possible.



#include <xmmintrin.h> // for __m128, and if you want to use intrinsics.

struct vec4 {
union {
struct { float x, y, z, w };
__m128 v;
};
};

void normalize4_SSE( vec4* vecs, int count )
{
__asm {

mov eax, vecs
mov ecx, count
shr ecx, 2
shl ecx, 6 // sizeof(vec4) * ((count >> 2) << 2) i.e. divisible by 4.
add ecx, eax
mov edx, eax

// first compute a transpose

movaps xmm5, [eax]
movaps xmm7, [eax + 20h]
movaps xmm4, [eax]
movaps xmm6, [eax + 20h]

unpckhps xmm5, [eax + 10h]
unpckhps xmm7, [eax + 30h]
unpcklps xmm4, [eax + 10h]
unpcklps xmm6, [eax + 30h]

// slightly bizzare way of packing a 4x4 matrix transpose into 4 registers.
// Fancy footwork not really necessary here, but it will be in the loop.

movlhps xmm5, xmm7
movaps xmm7, xmm4
movlhps xmm4, xmm6
movhlps xmm6, xmm7

norm_loop:

add eax, 40h

// all 4 dot products at once here.

mulps xmm4, xmm4
mulps xmm5, xmm5
mulps xmm6, xmm6

addps xmm4, xmm5
addps xmm6, xmm4

// break up the dependency chain by starting next transpose in xmm4-7
// From now on, he current normalize happens in xmm0-xmm3,
// While the next transpose is in xmm4-7

movaps xmm4, [eax]
movaps xmm5, [eax]
movaps xmm7, [eax + 20h]

rsqrtps xmm0, xmm6 // xmm6 free at last! Thank god almighty...
movaps xmm6, [eax + 20h]

unpcklps xmm4, [eax + 10h]
unpckhps xmm5, [eax + 10h]
unpcklps xmm6, [eax + 30h]
unpckhps xmm7, [eax + 30h]

shufps xmm3, xmm0, 11111111b
shufps xmm2, xmm0, 10101010b
shufps xmm1, xmm0, 01010101b
shufps xmm0, xmm0, 00000000b

movhlps xmm3, xmm3
movhlps xmm2, xmm2
movhlps xmm1, xmm1

movlhps xmm5, xmm7
movaps xmm7, xmm4

// possible speed-up by intermixing the mov*s with mul*s to try to keep
// the different proccessing units busy. Makes code assy though.

mulps xmm3, [edx + 30h]
mulps xmm2, [edx + 20h]
mulps xmm1, [edx + 10h]
mulps xmm0, [edx]

movlhps xmm4, xmm6
movhlps xmm6, xmm7

movaps [edx + 30h], xmm3
movaps [edx + 20h], xmm2
movaps [edx + 10h], xmm1
movaps [edx ], xmm0

add edx, 40h
cmp eax, ecx

jne norm_loop

}
}





Note that this version reads one iteration beyond 'count' vec4s, and only works with count % 4 == 0. Of course, that's all easy to fix, this way just keeps the code most compact.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this