optimize smooth code

Started by
12 comments, last by Basiror 18 years, 8 months ago
How would you optimize this code? 1. replacing the '%' operations with simple if else conditions? currently i use the modulo to smooth the transitions 2.maybe one could unroll the loop completely? do compilers unroll loops with 32*32 iterations? or would this blow the code up too much?

//template function for smoothing
	template<class T> void smoothmap(T &x)
	{
		uint32 i = 0;
		uint32 j = 0;

		for(i=1;i<=x.m_uiSize;i++)
		{
			for(j=1;j<=x.m_uiSize;j++)
			{
				x.m_Map[i%x.m_uiSize][j%x.m_uiSize] = 
					//center
					x.m_Map[j]*0.25f 
					//diagonal corners
					+ x.m_Map[(i-1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i-1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i+1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i+1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
					//horizontal and vertical corners
					+ x.m_Map[(i-1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
					+ x.m_Map[(i+1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
					+ x.m_Map[(i)%x.m_uiSize][(j-1)%x.m_uiSize]*0.125f
					+ x.m_Map[(i)%x.m_uiSize][(j+1)%x.m_uiSize]*0.125f
					//end of instruction
					;
			}
		}
	}
2. is there a way to define count as a class specific constant? so i don t have to store it in a variable? that way i could pass it to the templated functions and give the compiler the possibility to unroll everything

template<class T, int count>
class someclass
{
private:
T m_Map[count][count];
}


http://www.8ung.at/basiror/theironcross.html
Advertisement
ah I think I found the solution

i just add a enum to the class definition like this one

//template quadratic map class	template<class T, uint32 uiSize> class map	{	public:		//typedefs and enums		enum {MAP_SIZE = uiSize};		typedef	T	type_name;	public:		//member functions constructors & destructors		//fills the map with zero		map():m_uiSize(uiSize)		{			T tFill = 0;			std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);		}	public:		//friend function declarations to access private members		template<class U> friend void fillmap(U &x);		template<class U> friend void smoothmap(U &x);	private:		uint32	m_uiSize;		T	m_Map[uiSize][uiSize];	};


that should do the job
http://www.8ung.at/basiror/theironcross.html
It would probably be faster to do something like this: (not sure if the compiler is this smart?)

template<class T> void smoothmap(T &x){	uint32 i;	uint32 j;	uint32 ti;	uint32 tj;	uint32 t1, t2;	uint32 im,ip,jm,jp;	for(i=1;i<=x.m_uiSize;i++)	{		ti = i%x.m_uiSize;		im = (i-1)%x.m_uiSize;		ip =(i+1)%x.m_uiSize;		for(j=1;j<=x.m_uiSize;j++)		{			tj = j%x.m_uiSize;			jm=(j-1)%x.m_uiSize;			jp=(j+1)%x.m_uiSize;			x.m_Map[ti][tj]>>=2;			//center point			//diagonal corners			t1 = (x.m_Map[im][jm]				+ x.m_Map[im][jp]				+ x.m_Map[ip][jm]				+ x.m_Map[ip][jp])>>4;			//horizontal and vertical corners			t2 = (x.m_Map[im][tj]				+ x.m_Map[ip][tj]				+ x.m_Map[ti][jm]				+ x.m_Map[ti][jp])>>3;			x.m_Map[ti][tj]+= t1+t2;		}	}}


This removes a bunch of modules operations, and removes a bunch of multiplication operations as well. Of course there are more optimizations, but this should be a good start. By shifting by 4, it is the same as dividing by 16 (which is what you were trying to do with your mult), and 3 is equivalant to dividing by 8 (as far as integer math is concerned!). If you're using floats, you must replace >>4 with *0.0625f and >>3 with *0.125f, and >>=2 with *=0.25f but that still saves 6 mults per iteration ;). Why do you use [i%x.m_uiSize] and another spot use ? You realize that if you're going from 1 -> uiSize, it will be reading/writing an invalid memory location right? My solution fixes this problem by using modulus for both of them, without actually requiring any extra operations (just extra variables). Each of the modulus for the i loop only has to be executed uiSize times rather than uiSize*uiSize, and it removes the # of times that you repeat the exact same operation ;).
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.

You can always look at the compiler assembly output to verify this sort of thing.
-Mike
Quote:Original post by Anon Mike
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.


Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.

[Edited by - Promit on August 1, 2005 7:27:06 PM]
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Promit
Quote:Original post by Anon Mike
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.


Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.


erm, luckily I have only bought AMDs for arond 5 years now *G*

@ready4dis: i just missed the % in this case but I thnk I will modify it to what you suggested
http://www.8ung.at/basiror/theironcross.html
so i have applied the changes

//template function for smoothing	template<class T> void smoothmap(T &x)	{		uint32 i;		uint32 j;		uint32 ti1,ti2,ti3;		uint32 tj1,tj2,tj3;		for(i=1;i<=T::MAP_SIZE;i++)		{			//do modulos only once			ti1 = i%T::MAP_SIZE;			ti2 = (i+1)%T::MAP_SIZE;			ti3 = (i-1)%T::MAP_SIZE;			for(j=1;j<=T::MAP_SIZE;j++)			{				tj1 = j%T::MAP_SIZE;				tj2 = (j+1)%T::MAP_SIZE;				tj3 = (j-1)%T::MAP_SIZE;								x.m_Map[ti1][tj1] = 					//center					x.m_Map[ti1][tj2]*0.25f 					//diagonal corners					+ (x.m_Map[ti3][tj3]					+ x.m_Map[ti3][tj2]					+ x.m_Map[ti2][tj3]					+ x.m_Map[ti2][tj2])*0.0625f					//horizontal and vertical corners					+ (x.m_Map[ti3][tj1]					+ x.m_Map[ti2][tj1]					+ x.m_Map[ti1][tj3]					+ x.m_Map[ti1][tj2])*0.125f;					//end of instruction					;			}		}	}


as for the floating point multiplications, the m_Map values should be floats in general and i guess if i would use integers the compiler would optimize it to shifts on his own
http://www.8ung.at/basiror/theironcross.html
You don't want the compiler to unroll that, it will probably only worsen performance. There are more higher level optimisations left which would actually make a difference.

These moduluses are redundant, so you may as well remove them:
			ti3 = (i-1)%T::MAP_SIZE;				tj3 = (j-1)%T::MAP_SIZE;
Actually, given that it shouldn't matter in what order the smoothing function is applied to the grid elements, there's a cunning way of doing it without any modulus at all (or even an if!)
for (ti1=0, ti2=T::MAP_SIZE-1, ti3=1; ti2 >= 0; ti3=ti1, ti1=ti2, --ti2)    for (tj1=0, tj2=T::MAP_SIZE-1, tj3=1; tj2 >= 0; tj3=tj1, tj1=tj2, --tj2)
This even eliminates i and j too.[cool]

Anyway, by now I assume it's fast enough, if not, the memory access is probably the bottleneck at this stage.[grin]
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
yeah what i could think of would be reorder the memory access to reduce memory jumping if that is still an issue at all with all those optimizations

i assume the compiler prefetches the array on its own so i don t have to add a prefetch instruction right?
http://www.8ung.at/basiror/theironcross.html
I run in a little problem now
at the top you see the typedef with 256 which created a cross:map with an array
T[256][256]

so far so good everything works ok as long as i don t insert a nummer > 256
if i insert 512 the application instantly crashes without printing the std:cerr at all

so something must be screwed i use the VC++ toolkit 2003
any idea on what s going wrong?

this piece of code instanciates the map
typedef cross::map<float,256> map32;int main(int argc,char *argv[]){	std::cerr<<"short: "<<map32::MAP_SIZE<<std::endl;		map32 maps;	cross::fillmap<map32>(maps);	cross::smoothmap<map32>(maps);	cross::smoothmap<map32>(maps);	cross::smoothmap<map32>(maps);	maps.dump8(std::string("test.raw"));	return 0;}



2. class declaration and implementation of fillmap +dump8
//template quadratic map class	template<class T, uint32 uiSize> class map	{	public:		//typedefs and enums		//type specific definition of size to allow unrolling of loops		enum {MAP_SIZE = uiSize};		typedef	T	type_name;	public:		//member functions constructors & destructors		//fills the map with zero		map():m_uiSize(uiSize)		{			T tFill = 0;			std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);		};		//dumps a 8bit heightmap		void dump8(std::string strFilename)		{			uint32	i;			uint32	j;			std::ofstream file;			file.open(strFilename.c_str());			for(i=0;i<uiSize;i++)			{				for(j=0;j<uiSize;j++)				{					file << static_cast<uchar8>((m_Map[j] + 1.0f)*127);				}			}			file.close();		};	public:		//friend function declarations to access private members		template<class U> friend void fillmap(U &x);		template<class U> friend void smoothmap(U &x);	private:		uint32	m_uiSize;		T	m_Map[MAP_SIZE][uiSize];	};//template function for noise generation	template<class T> void	fillmap(T &x)	{		uint32 i;		uint32 j;				//loop through the rows		for(i=0;i<T::MAP_SIZE;i++)		{			//loop through the colums			for(j=0;j<T::MAP_SIZE;j++)			{				//set a new random seed everytime				srand(static_cast<int>(time(NULL)));				//retrieve a noise value with a new random pattern				x.m_Map[j] = noise(i,j,rand());			}		}	};




Just found the problem it caused a stack overflow i moved the instance of map32 maps out of the main function and it worked fine

i didn t know that the stack is so tiny
the map32 class is only 1 mb large
http://www.8ung.at/basiror/theironcross.html

This topic is closed to new replies.

Advertisement