Normalization Approximation in 30 operations

Started by
26 comments, last by Chris_F 12 years, 2 months ago

Just delete the post since the community seems to think it's useless.

It's kind of important that you realize why they might think that. From a purely mathematical sense you should be starting in 2D and not 3D. Ideally any normalization algorithm created in 2D will expand to 3D or so I believe.

Also here's a paste of your code running on 100 random vectors. Your algorithm doesn't work at all.
Advertisement
What's wrong with SSE?

__declspec(noinline) void normalize(__m128 &col0) { // not inlined for testing purpose
__m128 dot = _mm_mul_ps(col0, col0);
dot = _mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(2, 3, 0, 1)));
col0 = _mm_div_ps(col0, _mm_sqrt_ps(_mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(0, 0, 3, 3)))));
}


Original topic is about normalization in 30 operations, according to disassembly it's just 12 operations

__declspec(noinline) void normalize(__m128 &col0) {
__m128 dot = _mm_mul_ps(col0, col0);
012D1340 movaps xmm1,xmmword ptr [eax]
012D1343 movaps xmm2,xmm1
012D1346 mulps xmm2,xmm1
dot = _mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(2, 3, 0, 1)));
012D1349 movaps xmm0,xmm2
012D134C shufps xmm0,xmm2,0B1h
012D1350 addps xmm0,xmm2
col0 = _mm_div_ps(col0, _mm_sqrt_ps(_mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(0, 0, 3, 3)))));
012D1353 movaps xmm2,xmm0
012D1356 shufps xmm2,xmm0,0Fh
012D135A addps xmm2,xmm0
012D135D sqrtps xmm0,xmm2
012D1360 divps xmm1,xmm0
012D1363 movaps xmmword ptr [eax],xmm1
}


If we exclude first and last instructions (which move data from and into memory) actual calculation is just 10 operations. Accuracy isn't an issue either.
Quick hack-up in SSE4:


movaps xmm0,xmmword ptr [Vector]
movaps xmm1,xmm0
dpps xmm0,xmm0,CCh
rsqrtps xmm0,xmm0
mulps xmm1,xmm0
movaps xmmword ptr [Vector],xmm1


Not tested, and I'm not 100% sure the immediate value for dpps is right.
Latest project: Sideways Racing on the iPad
This has to be a troll at this point, right? Please let this be a troll...

Not tested, and I'm not 100% sure the immediate value for dpps is right.


Should be right. Didn't knew there's dot instruction. Sadly my CPU doesn't have SSE4 sad.png





This has to be a troll at this point, right? Please let this be a troll...


No it's not troll. Just wondering why FPU if there's SSE versions. It's probably even easier, like Tachikoma suggested, it can fit in one line

__m128 normalize(const __m128 vector) {
return _mm_div_ps(vector, _mm_sqrt_ps(_mm_dp_ps(vector, vector, 0xFF))); // on second thought it might be FF, not CC
}
Also, for people who want to try to hack normalizations to be faster, this is a very interesting read: Chris Lomont - Fast Inverse Square Root.

Also, for people who want to try to hack normalizations to be faster, this is a very interesting read: Chris Lomont - Fast Inverse Square Root.

As others have already mentioned in this thread, that algorithm is no longer practical on a modern architecture. But it remains as a mathematical curiously none the less.
Latest project: Sideways Racing on the iPad

Also, for people who want to try to hack normalizations to be faster, this is a very interesting read: Chris Lomont - Fast Inverse Square Root.


Maybe a better read: http://assemblyrequi...ng-square-root/ and http://assemblyrequired.crashworks.org/2009/10/20/square-roots-in-vivo-normalizing-vectors/

This topic is closed to new replies.

Advertisement