Jump to content
  • Advertisement
Sign in to follow this  

Fast Vector Normalization

This topic is 4853 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

One of my many side projects is to try and come up with faster routines than the Direct3D Extension (D3DX) API. Having read in several texts that D3DX is indeed fast, the following SSE code is faster by several null cycles:
inline void FastVectorNormalize (D3DXVECTOR4 &vector)
		mov	eax,	vector
		movaps	xmm0,	[eax]
		movaps	xmm2,	xmm0
		mulps	xmm0,	xmm0
		movaps	xmm1,	xmm0
		shufps	xmm0,	xmm0,	_MM_SHUFFLE (2, 1, 0, 3)
		addps	xmm1,	xmm0
		movaps	xmm0,	xmm1
		shufps	xmm1,	xmm1,	_MM_SHUFFLE (1, 0, 3, 2)
		addps	xmm0,	xmm1
		rsqrtps	xmm0,	xmm0
		mulps	xmm0,	xmm2
		movaps	[eax],	xmm0	
Please feel free to use and adapt this code if your program is normalization intensive. However be warned, the SSE instruction 'rsqrtps' uses internal processor tables to compute reciprocal square roots (hence the speed) and so you will lose some accuracy to about 4 decimal places.

Share this post

Link to post
Share on other sites
Thanks for sharing.

Though there are some things to bear in mind with the above when comparing to D3DX:

1) older CPUs such as the K7 version of AMD's Athlon don't support SSE instructions. The D3DX routines have multiple code paths to take advantage of whatever vector extension support, if any, is present on the CPU (3DNow!, SSE, SSE2, or plain x87). A small part of the extra overhead of D3DX compared to an extension specific routine like the above is due to being able to run on any CPU.

2) the SSE code above assumes the input vector is aligned (aps mov's being used rather than ups mov's), so it's worth enforcing alignment with a __declspec(align())

3) when comparing performance of hand written assembly against D3DX, static dissassembly and static cycle counting is not the way to do it; when first called, D3DX sets up some of its routines based on which CPU extensions are present; static dissassembly won't always take you to the right place.

4) good profiling of small routines is difficult. When profiling to compare performance, make sure a)each is repeated enough to get a sensible average out of the spikes present in a multitasking, multithreaded OS (say, 10000 calls to each); b)each is given a chance to warm up (initialisation, see 3), (say, 100 calls to each); c)each is tested separately (it's no good if one incurs cache misses and the other doesn't).

5) you can put all the inline, __inline, __forceinline, and __declspec(naked)'s you like around the function - if that code gets inlined, the MSVC optimizer will be very wary of all variables and registers you use; in particular, the use of 'vector' is v.likely to be assumed to be capable of pointer aliasing (reference or not, the asm code still uses it as a pointer!).

6) for inlined code, SSE intrinsics will play more nicely with compiler optimiser than plain ASM.

7) of course, the fastest code in the world is code that doesn't need to be run at all. If normalization is a bottleneck in your code, I'd suggest there's a higher level problem that needs addressing first. Many people do have a tendency to over-normalize. It can be handy to remember that for a unit vector x2+y2+z2=1. [wink]

8) a hand tuned ASM/intrinsic routine here and another there isn't going to get you big performance wins compared to another hand tuned ASM/intrinsic routine unless it can do a sizable chunk of work. Normalizing a single vector isn't a sizable chunk of work; transforming and normalizing 1000 vectors is.

Share this post

Link to post
Share on other sites
Thx for your input S1CA.

Yes I forgot to mention the 16 bit align rule when addressing memory in SSE. This is a quick re-hash of part of my code, where I'm using a 16 bit aligned struct to represent my 4-vector. Rather than submit the entire struct, I wanted to keep things simple.

Anyway, I assume if people are already familiar with SSE, then they'll be aware of SSE semantics regarding alignment, processor versions and optimization tricks like inlining etc etc.

I generally tend to test time-critical code using a macro I wrote several years ago, based on benchmarking code in the MSDN technical article:

Developing Optimized Code with Microsoft Visual C++ 6.0


If you have not already done so already, give it a good read. I've found it quite helpful on optimization issues under MSVC++ 6.0.

I am aware of the D3DX startup protocols. However, as I tend to time critical code over BILLIONS of iterations (and besides, I call D3DX routines BEFORE the code-timing sections), such protocols do not skew my results.

I'm pretty confident that my timing techniques are as accurate as I can make them, I've been using them for long enough and only test SMALL sections of code (i.e. a singular function or two) to avoid cache anomalies. And it appears my SSE code beats D3DX vector normalization by a mere 7 nanoseconds (on my 3.0 GHZ Pentium machine). Not much perhaps, but may be of some significance in a normalization intensive application.

Anyway, my code is merely a little optimization trick (of which there are PLENTY in numerous books I've read) and is in no way critical to 99.999% of 3D applications. It's just nice to get your hands dirty in a little ASM from time to time :-)

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!