The typical representation of a 3D vector is with 3 floating point numbers. If you use 32 bit floats this makes each vector 12 bytes. A 12 byte vector provides a huge amount of precision and versatility. It can point in any direction with very high accuracy and it can express vectors with a huge range of lengths also with very high accuracy. For unit length vectors you don't usually need this kind of high precision and you can save lots of memory using quantized normals.

This page describes my method for mapping 12 byte vectors into 2 byte quantized normals and back. The transformation in each direction can be achieved in constant time with just a few table lookups.

First though have a look at what you lose in accuracy. In these two images I have a sphere represented with about 40000 triangles. In the first picture the normals used to light the sphere are quantized 2 byte normals. In the second picture I am using complete 12 byte normals. ( click on the image to enlarge it )

Notice how the second picture is perfect sphericity itself. But the first picture is not too bad. There are two subtle creases along the equator where my quantization method is weakest, and the surface has a very slight wobble to it, unlike the second picture. Now remember that these pictures represent the worst case for using normals to calculate lighting. The sphere is lit obliquely and the surface is a nice smooth continuous curve. In the typical case of a 3d model, or a terrain, this quantization error is absolutely invisible. Go to my terrain rendering page to see what I mean. All of those scenes use quantized normals.

I have included the code at the bottom. Here is a quick explanation of how it works. If you know that your vector is unit length then you know that:

[font="Courier New"][color="#000080"][bquote]X[sup]2[/sup] + Y[sup]2[/sup] + Z[sup]2[/sup] = 1[/bquote][/color][/font]

Therefore you know that if you have 2 numbers you can compute the magnitude of the third number, but not its sign. So you need to store off the sign of each element. Given that I am going to pack this whole thing into 16 bits, I use 3 bits for the sign of each vector component leaving me 13 bits to represent the magnitude of two of the components.You also know that the magnitude of any component ranges from 0 to 1 so it makes quantization practical.

It would be simple to just code one value into 6 bits and another value into the remaining 7 bits and just use those two numbers to look up the third value in a table. However it is possible to express both numbers using seven bits each but using a total of only 13 bits for both. This is because when you select two components of a unit vector, not every pair of numbers less than 1 is possible. For example if X == 1.0 then Y and Z must both be 0. As X becomes smaller more values become possible for Y and Z. If you make your quantization table a 2d table indexed on the square of X and the square of Y, then the cells of the table that can be filled with valid values for Z form a triangle. You can see that we are using only half of the space. Which is great because I am short one bit!

So say that I will represent X with 7 bits, and Y with 6 bits and look up Z in my table. To get a 7th bit of resolution for Y I just check to see if my quantized Y would be bigger than the dynamic range of 6 bits. In that case I subtract both numbers from 127. Now the X still fits in the X slot and the Y is guaranteed to fit in the Y slot and the pair of numbers will index into a part of the table that is still not coded with values for Z.

There are better ways to quantize unit vectors. I have seen one really good way in a recent SIGGRAPH paper. It takes advantage of the stuff I describe here but also takes advantage of the symmetry within a single quadrant to get even more accuracy within 16 bits. The problem is that mapping from 12 byte vectors to 2 byte vectors involves doing some transcendentals. I believe going the other way from 2 to 12 is just as fast as my method. So you might want to look for that paper. I'll put a reference to it here when I find it.

Here is the code that implements my quantization method. The c3dVector object is just of structure with 3 floats in it. Your typical 12 byte vector.

`#ifndef _3D_UNITVEC_H`

#define _3D_UNITVEC_H

#include

#include

#include "3dmath/vector.h"

#define UNITVEC_DECLARE_STATICS \

float cUnitVector::mXYComponents[128]; \

float cUnitVector::mZComponents[8192]; \

c3dVector cUnitVector::mTmpVec;

// upper 3 bits

#define SIGN_MASK 0xe000

#define XSIGN_MASK 0x8000

#define YSIGN_MASK 0x4000

#define ZSIGN_MASK 0x2000

// middle 6 bits - xbits

#define TOP_MASK 0x1f80

// lower 7 bits - ybits

#define BOTTOM_MASK 0x007f

// NOTE: this quantization actually looks pretty good but it seems

// that it will tend to cluster around z == 1 | z == -1. It might be

// better if instead of quantizing x^2 and y^2 we quantize (1-x)^2 and (1-y)^2.

// it might distribute the vectors better.

// make a macro to encode and unencode vector components and test these separately

#define UV_PACK( V ) (int)((( V * V ) * 126.0f ) + 0.5f )

//v.x = 1.0f - v.x; v.x *= v.x; v.x = 1.0f - v.x;

// v.y = 1.0f - v.y; v.y *= v.y; v.y = 1.0f - v.y;

// int xbits = (int)((( v.x ) * 126.0f ) + 0.45f );

// int ybits = (int)((( v.y ) * 126.0f ) + 0.45f );

#define UV_UNPACK( X ) (( X > 126 ) ? 1.0f : (float)(sqrt((float)X / 126.0f )));

// comX = 1.0f - comX;

// comX = (float)sqrt( comX );

// comX = 1.0f - comX;

// mXYComponents[x] = comX;

// a compressed unit vector. reasonable fidelty for unit vectors in a 16 bit

// package. Good enough for surface normals we hope.

class cUnitVector : public c3dMathObject

{

public:

cUnitVector() { mVec = 0; }

cUnitVector( const c3dVector& vec )

{

packVector( vec );

}

cUnitVector( unsigned short val ) { mVec = val; }

cUnitVector& operator=( const c3dVector& vec )

{ packVector( vec ); return *this; }

operator c3dVector()

{

unpackVector( mTmpVec );

return mTmpVec;

}

void packVector( const c3dVector& vec )

{

assert( vec.isValid());

c3dVector v = vec;

// convert from c3dVector to cUnitVector

assert( v.length() <= 1.001f );

// handle the sign bits

mVec = 0;

if ( v.x < 0 ) { mVec |= XSIGN_MASK; v.x = -v.x; }

if ( v.y < 0 ) { mVec |= YSIGN_MASK; v.y = -v.y; }

if ( v.z < 0 ) { mVec |= ZSIGN_MASK; v.z = -v.z; }

// quantize the square of the xand y components. we quantize the square

// because the sum of these squares is never bigger than 1

int xbits = UV_PACK( v.x );

int ybits = UV_PACK( v.y );

// we can ignore the z component because we can reconstruct it later

// from the x and y since we require that the input always be a

// unit vector.

if ( xbits >= 64 )

{

ybits = 127 - ybits;

xbits = 127 - xbits;

}

mVec |= ( xbits << 7 );

mVec |= ybits;

}

void unpackVector( c3dVector& vec )

{

// convert from cUnitVector to c3dVector

// look up the z component given the quantize x and y components

vec.z = mZComponents[mVec & ~SIGN_MASK];

short xbits = (( mVec & TOP_MASK ) >> 7 );

short ybits = ( mVec & BOTTOM_MASK );

if (( ybits + xbits ) > 127 )

{

ybits = 127 - ybits;

xbits = 127 - xbits;

}

// this array contains the square root of the squared values used

// in the quantization. so everything works out.

vec.y = mXYComponents[ybits];

vec.x = mXYComponents[xbits];

// set all the sign bits

if ( mVec & XSIGN_MASK ) vec.x = -vec.x;

if ( mVec & YSIGN_MASK ) vec.y = -vec.y;

if ( mVec & ZSIGN_MASK ) vec.z = -vec.z;

assert( vec.isValid());

}

static void initializeStatics()

{

float* zComp = mZComponents;

for ( int x = 0; x < 128; x++ )

{

mXYComponents[x] = UV_UNPACK( x );

assert( _finite( mXYComponents[x] ));

for ( int y = 0; y < 128; y++ )

{

if (( x + y ) < 128 )

{

float comX = (float)x / 126.0f;

if ( comX > 1.0f ) comX = 1.0f;

float comY = (float)y / 126.0f;

if ( comY > 1.0f ) comY = 1.0f;

assert( _finite( comY ));

int xbits = x;

int ybits = y;

if ( xbits >= 64 )

{

ybits = 127 - ybits;

xbits = 127 - xbits;

}

int index = ( xbits * 128 ) + ybits;

zComp[index] = (float)sqrt( fabs( 1.0f - comX - comY ));

assert( _finite( zComp[index] ));

}

}

}

}

void test()

{

#define TEST_RANGE 4

#define TEST_RANDOM 100

#define TEST_ANGERROR 1.0

float maxError = 0;

float avgError = 0;

int numVecs = 0;

{for ( int x = -TEST_RANGE; x < TEST_RANGE; x++ )

{

for ( int y = -TEST_RANGE; y < TEST_RANGE; y++ )

{

for ( int z = -TEST_RANGE; z < TEST_RANGE; z++ )

{

if (( x + y + z ) == 0 ) continue;

c3dVector vec( (float)x, (float)y, (float)z );

c3dVector vec2;

vec.normalize();

packVector( vec );

unpackVector( vec2 );

float ang = vec.dot( vec2 );

ang = (( fabs( ang ) > 0.999f ) ? 0 : (float)acos(ang));

if (( ang > TEST_ANGERROR ) | ( !_finite( ang )))

{

cerr << "error: " << ang << endl;

cerr << "orig vec: " << vec.x << ",\t" << vec.y << ",\t"

<< vec.z << "\tmVec: " << mVec << endl;

cerr << "quantized vec2: " << vec2.x << ",\t" << vec2.y << ",\t"

<< vec2.z << endl << endl;

}

avgError += ang;

numVecs++;

if ( maxError < ang ) maxError = ang;

}

}

}}

for ( int w = 0; w < TEST_RANDOM; w++ )

{

c3dVector vec( genRandom(), genRandom(), genRandom());

c3dVector vec2;

vec.normalize();

packVector( vec );

unpackVector( vec2 );

float ang =vec.dot( vec2 );

ang = (( ang > 0.999f ) ? 0 : (float)acos(ang));

if (( ang > TEST_ANGERROR ) | ( !_finite( ang )))

{

cerr << "error: " << ang << endl;

cerr << "orig vec: " << vec.x << ",\t" << vec.y << ",\t"

<< vec.z << "\tmVec: " << mVec << endl;

cerr << "quantized vec2: " << vec2.x << ",\t" << vec2.y << ",\t"

<< vec2.z << endl << endl;

}

avgError += ang;

numVecs++;

if ( maxError < ang ) maxError = ang;

}

{ for ( int x = 0; x < 50; x++ )

{

c3dVector vec( (float)x, 25.0f, 0.0f );

c3dVector vec2;

vec.normalize();

packVector( vec );

unpackVector( vec2 );

float ang = vec.dot( vec2 );

ang = (( fabs( ang ) > 0.999f ) ? 0 : (float)acos(ang));

if (( ang > TEST_ANGERROR ) | ( !_finite( ang )))

{

cerr << "error: " << ang << endl;

cerr << "orig vec: " << vec.x << ",\t" << vec.y << ",\t"

<< vec.z << "\tmVec: " << mVec << endl;

cerr << " quantized vec2: " << vec2.x << ",\t" << vec2.y << ",\t"

<< vec2.z << endl << endl;

}

avgError += ang;

numVecs++;

if ( maxError < ang ) maxError = ang;

}}

cerr << "max angle error: " << maxError

<< ", average error: " << avgError / numVecs

<< ", num tested vecs: " << numVecs << endl;

}

friend ostream& operator<< ( ostream& os, const cUnitVector& vec )

{ os << vec.mVec; return os; }

protected:

unsigned short mVec;

static float mXYComponents[128];

static float mZComponents[8192];

static c3dVector mTmpVec;

};

#endif // _3D_VECTOR_H