# Normalization Approximation in 30 operations

## Recommended Posts

I think I finally figured out a version that works... it is this

squaredvector = position.x * position.x + position.y * position.y + position.z * position.z
onediv = 1 / squaredvector
position.x *= signx
position.y *= signy
position.z *= signz
xper = position.x * onediv
yper = position.y * onediv
zper = position.z * onediv
posyz = position.y + position.z
posxz = position.x + position.z
posxy = position.x + position.y
xmul = 1.0 - (posyz * onediv)
ymul = 1.0 - (posxz * onediv)
zmul = 1.0 -(posxy * onediv)
normal.x = xper * (xmul * posyz)* signx
normal.y = yper * (ymul * posxz)* signy
normal.z = zper * (zmul * posxy) * signz

Think this is the final version.

##### Share on other sites
Compare this to the code
[code]
float x,y,z;
float r = 1.0f / sqrt(x*x + y*y + z*z);
x *= r;
y *= r;
z *= r;
[/code]
which has one sqrt, div, six muls and two adds. See [url="http://clb.demon.fi/MathGeoLib/docs/float3_Normalize.php"]here[/url] for an example.

Typically sqrt and divisions dominate in a code snippet like that. To see whether your version is a good approximation, I recommend you profile these two versions for runtime performance (clock cycles taken to normalize), and by how much the normalization is off in the worst case.

##### Share on other sites
The above formula is off by 15.47005383792515290182975610039% so if you multiply the above normals by 0.86602540378443864676372317075293 you should get a completely accurate normal.

##### Share on other sites
This code doesn't need a square root, and there's only 9 operations because the 1.0 divided by the 3 added up positions only needs to be done once per Vector.

1 division, 3 multiplies and, 5 additions.

As far as I've noticed the most it's off by is about 15%

##### Share on other sites
Could you please "normalize" the two vectors <1,0,0> and <-1,1,0> for me?

##### Share on other sites
[quote name='clb' timestamp='1326812888' post='4903634']
Typically sqrt and divisions dominate in a code snippet like that.
[/quote]

true for std::sqrt, but not true for _mm_sqrt_ss. Unless you are using an Atom, div and sqrt will take just as long as an add or sub. On an i3/i5/i7, doing 1/sqrt is actually slower (an additional dependency chain) and less accurate than simply dividing through by the length (since rcp + mul is less accurate than div). [As an aside, you could just use rsqrt instead]
Of course, if you actually want to take account of the fact that normalization can cause a division by zero, then mul will probably still make more sense...

With AVX (sandy bridge) it is possible to normalise 8 x Vector3's with nothing more than: 6 muls, 2 adds, and an rsqrt. The OP's method produces nonsense results, and uses 9 adds, 3 mul, 3 div. That's 15 ops vs 9 for something that is at least 8 times slower.

##### Share on other sites
On a GPU I wouldn't touch this code with a 10-foot pole, no matter how happy the OP may be with it as an approximation. Normalization on a GPU is just a dp3, an rsq and a mul - 1 instruction slot each, 3 total, and all the benefits of parallel processing.

##### Share on other sites
*EDIT*

if(position.x < 0)

position.x *= -1

signx = -1

}
addvector = position.x + position.y + position.z

positionerror = 1.0 - addvector * 0.01

normal.x = (onediv * (position.x + position.x)) * positionerror) * signx;

At 20 operations. 1 divide, 13 muls and 5 adds and 1 subtraction

##### Share on other sites
[quote name='mhagain' timestamp='1326850292' post='4903853'] On a GPU I wouldn't touch this code with a 10-foot pole, no matter how happy the OP may be with it as an approximation. Normalization on a GPU is just a dp3, an rsq and a mul - 1 instruction slot each, 3 total, and all the benefits of parallel processing. [/quote]

This is meant more for people who do software graphics.

##### Share on other sites
So basically you are trying to invent faster sqrt approximation? It has been already done. One of such methods are called Newton-Raphson method.
And people with software graphics will choose that over any other strange code (with strange limitations on max values) any day and any time.
One implementation of Newton-Raphson method (with smart choice of initial value) is "Carmack's inverse sqrt": [url="http://en.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code"]http://en.wikipedia....iew_of_the_code[/url]
Anway - for a long time any of these methods are slower that simply one assembler instruction - rsqrtss from SSE instruction set. And that is from year 1999! Welcome to the future!

Here are some performance numbers on FPU sqrt vs Newton method vs Carmack's invsqrt: http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/

##### Share on other sites
I knew about those. I'll test my code against those and see what happens. I'm guessing the SSE will go faster.

Although if new things are never tried, nothing is ever advanced.

##### Share on other sites
[quote name='Butabee' timestamp='1326861409' post='4903886']
Although if new things are never tried, nothing is ever advanced.
[/quote]

Trying arbitrary permutations of things without a decent mathematical foundation is not only unproductive, but counter-productive as you cannot really reason about the behavior and properties of the "approximation" you have produced.

This reflects in your confused statements (now some edited away) about just simply applying scaling factors and misuse of terms.

Things like the Carmack sqrt approximation have a foundation in mathematics and architecture (Newton-Rhapson iterations and IEEE754 vs. integer interaction), while your approach just substitutes a function with something with completely different properties and proclaim it to be The Next Best Things without a proper analysis of it.

Heck, a simple mesh plot of your function would have revealed the massive oscillations and regions of infinite results.

##### Share on other sites
I agree with Zao.
Also, your last version even has an if in it... A very bad idea.

##### Share on other sites

There is nothing about your operation that is "normalization". You do not preserve the direction, nor produce a magnitude remotely approaching unity.

The topic should be named "Mutilation in way too many operations", as it doesn't have any of the properties desirable by division by a norm.

Note that the first line is your "final" version is the square of the length of a vector. If you've gone the length of computing that dot product already, you're just a square root away from a properly working regular normalization, with all the excellent properties it guarantees.

##### Share on other sites
Also, please stop deleting your posts just because you've found something wrong with them. It only serves to harm discussion. I have restored your posts.

##### Share on other sites
This is getting out of hand. I understand your motivation, I love to tweak the maximum speed out of my code, but in this instance your approximation is so different from normalisation it should be called something else. In addition, I counted 21 muls and 1 divide in your latest version. Since the canonical version (as posted by clb) uses 6 muls, a divide and a sqrt, what you're suggesting is that 15 muls and inaccuracy are a preferable alternative to using a sqrt. Unfortunately, due to how normalised vectors are typically used, inaccuracy is a big problem, potentially leading to all sorts of weird rendering artefacts.

It might be slow, but I'm not sure where you got the impression that the sqrt operation is the anti-christ.

##### Share on other sites
So something that produces an approximate normal to what the real normal would be isn't normalizing... ok. If this was turned into SSE code it could be done in 11 operations.

This could work fine for lighting, generating normals as they are needed, but whatever, it's obvious no one here appreciates what I'm trying to do so I'm done.

##### Share on other sites
Just delete the post since the community seems to think it's useless.

##### Share on other sites
[quote name='Butabee' timestamp='1327120785' post='4904761']
...This could work fine for lighting
[/quote]

Have you taken the time to try it? Actually demoing your method would be a nice thing to do, assuming it works.

##### Share on other sites
[quote name='Butabee' timestamp='1327121080' post='4904762']
Just delete the post since the community seems to think it's useless.
[/quote]
Sorry, that's not how things work here.

##### Share on other sites
[quote name='Butabee' timestamp='1327121080' post='4904762']
Just delete the post since the community seems to think it's useless.
[/quote]
It's kind of important that you realize why they might think that. From a purely mathematical sense you should be starting in 2D and not 3D. Ideally any normalization algorithm created in 2D will expand to 3D or so I believe.

Also here's a [url="http://ideone.com/lfFnX"]paste of your code[/url] running on 100 random vectors. Your algorithm doesn't work at all.

##### Share on other sites
What's wrong with SSE?
[CODE]
__declspec(noinline) void normalize(__m128 &col0) { // not inlined for testing purpose
__m128 dot = _mm_mul_ps(col0, col0);
dot = _mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(2, 3, 0, 1)));
col0 = _mm_div_ps(col0, _mm_sqrt_ps(_mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(0, 0, 3, 3)))));
}
[/CODE]

Original topic is about normalization in 30 operations, according to disassembly it's just 12 operations
[CODE]
__declspec(noinline) void normalize(__m128 &col0) {
__m128 dot = _mm_mul_ps(col0, col0);
012D1340 movaps xmm1,xmmword ptr [eax]
012D1343 movaps xmm2,xmm1
012D1346 mulps xmm2,xmm1
dot = _mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(2, 3, 0, 1)));
012D1349 movaps xmm0,xmm2
012D134C shufps xmm0,xmm2,0B1h
col0 = _mm_div_ps(col0, _mm_sqrt_ps(_mm_add_ps(dot, _mm_shuffle_ps(dot, dot, _MM_SHUFFLE(0, 0, 3, 3)))));
012D1353 movaps xmm2,xmm0
012D1356 shufps xmm2,xmm0,0Fh
012D135D sqrtps xmm0,xmm2
012D1360 divps xmm1,xmm0
012D1363 movaps xmmword ptr [eax],xmm1
}
[/CODE]

If we exclude first and last instructions (which move data from and into memory) actual calculation is just 10 operations. Accuracy isn't an issue either.

##### Share on other sites
Quick hack-up in SSE4:

[code]
movaps xmm0,xmmword ptr [Vector]
movaps xmm1,xmm0
dpps xmm0,xmm0,CCh
rsqrtps xmm0,xmm0
mulps xmm1,xmm0
movaps xmmword ptr [Vector],xmm1
[/code]

Not tested, and I'm not 100% sure the immediate value for dpps is right.

##### Share on other sites
This has to be a troll at this point, right? Please let this be a troll...

##### Share on other sites
[quote name='Tachikoma' timestamp='1327154140' post='4904816']
Not tested, and I'm not 100% sure the immediate value for dpps is right.
[/quote]

Should be right. Didn't knew there's dot instruction. Sadly my CPU doesn't have SSE4 [img]http://public.gamedev.net//public/style_emoticons/default/sad.png[/img]

[quote name='osmanb' timestamp='1327154153' post='4904817']
This has to be a troll at this point, right? Please let this be a troll...
[/quote]

No it's not troll. Just wondering why FPU if there's SSE versions. It's probably even easier, like Tachikoma suggested, it can fit in one line
[CODE]
__m128 normalize(const __m128 vector) {
return _mm_div_ps(vector, _mm_sqrt_ps(_mm_dp_ps(vector, vector, 0xFF))); // on second thought it might be FF, not CC
}
[/CODE]

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628403
• Total Posts
2982477

• 9
• 10
• 9
• 21
• 24