# optimize smooth code

## Recommended Posts

How would you optimize this code? 1. replacing the '%' operations with simple if else conditions? currently i use the modulo to smooth the transitions 2.maybe one could unroll the loop completely? do compilers unroll loops with 32*32 iterations? or would this blow the code up too much?
//template function for smoothing
template<class T> void smoothmap(T &x)
{
uint32 i = 0;
uint32 j = 0;

for(i=1;i<=x.m_uiSize;i++)
{
for(j=1;j<=x.m_uiSize;j++)
{
x.m_Map[i%x.m_uiSize][j%x.m_uiSize] =
//center
x.m_Map[i][j]*0.25f
//diagonal corners
+ x.m_Map[(i-1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
+ x.m_Map[(i-1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
+ x.m_Map[(i+1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
+ x.m_Map[(i+1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
//horizontal and vertical corners
+ x.m_Map[(i-1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
+ x.m_Map[(i+1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
+ x.m_Map[(i)%x.m_uiSize][(j-1)%x.m_uiSize]*0.125f
+ x.m_Map[(i)%x.m_uiSize][(j+1)%x.m_uiSize]*0.125f
//end of instruction
;
}
}
}

2. is there a way to define count as a class specific constant? so i don t have to store it in a variable? that way i could pass it to the templated functions and give the compiler the possibility to unroll everything
template<class T, int count>
class someclass
{
private:
T m_Map[count][count];
}



##### Share on other sites
ah I think I found the solution

i just add a enum to the class definition like this one

//template quadratic map class	template<class T, uint32 uiSize> class map	{	public:		//typedefs and enums		enum {MAP_SIZE = uiSize};		typedef	T	type_name;	public:		//member functions constructors & destructors		//fills the map with zero		map():m_uiSize(uiSize)		{			T tFill = 0;			std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);		}	public:		//friend function declarations to access private members		template<class U> friend void fillmap(U &x);		template<class U> friend void smoothmap(U &x);	private:		uint32	m_uiSize;		T	m_Map[uiSize][uiSize];	};

that should do the job

##### Share on other sites
It would probably be faster to do something like this: (not sure if the compiler is this smart?)

template<class T> void smoothmap(T &x){	uint32 i;	uint32 j;	uint32 ti;	uint32 tj;	uint32 t1, t2;	uint32 im,ip,jm,jp;	for(i=1;i<=x.m_uiSize;i++)	{		ti = i%x.m_uiSize;		im = (i-1)%x.m_uiSize;		ip =(i+1)%x.m_uiSize;		for(j=1;j<=x.m_uiSize;j++)		{			tj = j%x.m_uiSize;			jm=(j-1)%x.m_uiSize;			jp=(j+1)%x.m_uiSize;			x.m_Map[ti][tj]>>=2;			//center point			//diagonal corners			t1 = (x.m_Map[im][jm]				+ x.m_Map[im][jp]				+ x.m_Map[ip][jm]				+ x.m_Map[ip][jp])>>4;			//horizontal and vertical corners			t2 = (x.m_Map[im][tj]				+ x.m_Map[ip][tj]				+ x.m_Map[ti][jm]				+ x.m_Map[ti][jp])>>3;			x.m_Map[ti][tj]+= t1+t2;		}	}}

This removes a bunch of modules operations, and removes a bunch of multiplication operations as well. Of course there are more optimizations, but this should be a good start. By shifting by 4, it is the same as dividing by 16 (which is what you were trying to do with your mult), and 3 is equivalant to dividing by 8 (as far as integer math is concerned!). If you're using floats, you must replace >>4 with *0.0625f and >>3 with *0.125f, and >>=2 with *=0.25f but that still saves 6 mults per iteration ;). Why do you use [i%x.m_uiSize] and another spot use [i]? You realize that if you're going from 1 -> uiSize, it will be reading/writing an invalid memory location right? My solution fixes this problem by using modulus for both of them, without actually requiring any extra operations (just extra variables). Each of the modulus for the i loop only has to be executed uiSize times rather than uiSize*uiSize, and it removes the # of times that you repeat the exact same operation ;).

##### Share on other sites
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.

You can always look at the compiler assembly output to verify this sort of thing.

##### Share on other sites
Quote:
 Original post by Anon MikeReplacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.

Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.

[Edited by - Promit on August 1, 2005 7:27:06 PM]

##### Share on other sites
Quote:
Original post by Promit
Quote:
 Original post by Anon MikeReplacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.

Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.

erm, luckily I have only bought AMDs for arond 5 years now *G*

@ready4dis: i just missed the % in this case but I thnk I will modify it to what you suggested

##### Share on other sites
so i have applied the changes

//template function for smoothing	template<class T> void smoothmap(T &x)	{		uint32 i;		uint32 j;		uint32 ti1,ti2,ti3;		uint32 tj1,tj2,tj3;		for(i=1;i<=T::MAP_SIZE;i++)		{			//do modulos only once			ti1 = i%T::MAP_SIZE;			ti2 = (i+1)%T::MAP_SIZE;			ti3 = (i-1)%T::MAP_SIZE;			for(j=1;j<=T::MAP_SIZE;j++)			{				tj1 = j%T::MAP_SIZE;				tj2 = (j+1)%T::MAP_SIZE;				tj3 = (j-1)%T::MAP_SIZE;								x.m_Map[ti1][tj1] = 					//center					x.m_Map[ti1][tj2]*0.25f 					//diagonal corners					+ (x.m_Map[ti3][tj3]					+ x.m_Map[ti3][tj2]					+ x.m_Map[ti2][tj3]					+ x.m_Map[ti2][tj2])*0.0625f					//horizontal and vertical corners					+ (x.m_Map[ti3][tj1]					+ x.m_Map[ti2][tj1]					+ x.m_Map[ti1][tj3]					+ x.m_Map[ti1][tj2])*0.125f;					//end of instruction					;			}		}	}

as for the floating point multiplications, the m_Map values should be floats in general and i guess if i would use integers the compiler would optimize it to shifts on his own

##### Share on other sites
You don't want the compiler to unroll that, it will probably only worsen performance. There are more higher level optimisations left which would actually make a difference.

These moduluses are redundant, so you may as well remove them:
			ti3 = (i-1)%T::MAP_SIZE;				tj3 = (j-1)%T::MAP_SIZE;
Actually, given that it shouldn't matter in what order the smoothing function is applied to the grid elements, there's a cunning way of doing it without any modulus at all (or even an if!)
for (ti1=0, ti2=T::MAP_SIZE-1, ti3=1; ti2 >= 0; ti3=ti1, ti1=ti2, --ti2)    for (tj1=0, tj2=T::MAP_SIZE-1, tj3=1; tj2 >= 0; tj3=tj1, tj1=tj2, --tj2)
This even eliminates i and j too.[cool]

Anyway, by now I assume it's fast enough, if not, the memory access is probably the bottleneck at this stage.[grin]

##### Share on other sites
yeah what i could think of would be reorder the memory access to reduce memory jumping if that is still an issue at all with all those optimizations

i assume the compiler prefetches the array on its own so i don t have to add a prefetch instruction right?

##### Share on other sites
I run in a little problem now
at the top you see the typedef with 256 which created a cross:map with an array
T[256][256]

so far so good everything works ok as long as i don t insert a nummer > 256
if i insert 512 the application instantly crashes without printing the std:cerr at all

so something must be screwed i use the VC++ toolkit 2003
any idea on what s going wrong?

this piece of code instanciates the map
typedef cross::map<float,256> map32;int main(int argc,char *argv[]){	std::cerr<<"short: "<<map32::MAP_SIZE<<std::endl;		map32 maps;	cross::fillmap<map32>(maps);	cross::smoothmap<map32>(maps);	cross::smoothmap<map32>(maps);	cross::smoothmap<map32>(maps);	maps.dump8(std::string("test.raw"));	return 0;}

2. class declaration and implementation of fillmap +dump8
//template quadratic map class	template<class T, uint32 uiSize> class map	{	public:		//typedefs and enums		//type specific definition of size to allow unrolling of loops		enum {MAP_SIZE = uiSize};		typedef	T	type_name;	public:		//member functions constructors & destructors		//fills the map with zero		map():m_uiSize(uiSize)		{			T tFill = 0;			std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);		};		//dumps a 8bit heightmap		void dump8(std::string strFilename)		{			uint32	i;			uint32	j;			std::ofstream file;			file.open(strFilename.c_str());			for(i=0;i<uiSize;i++)			{				for(j=0;j<uiSize;j++)				{					file << static_cast<uchar8>((m_Map[i][j] + 1.0f)*127);				}			}			file.close();		};	public:		//friend function declarations to access private members		template<class U> friend void fillmap(U &x);		template<class U> friend void smoothmap(U &x);	private:		uint32	m_uiSize;		T	m_Map[MAP_SIZE][uiSize];	};//template function for noise generation	template<class T> void	fillmap(T &x)	{		uint32 i;		uint32 j;				//loop through the rows		for(i=0;i<T::MAP_SIZE;i++)		{			//loop through the colums			for(j=0;j<T::MAP_SIZE;j++)			{				//set a new random seed everytime				srand(static_cast<int>(time(NULL)));				//retrieve a noise value with a new random pattern				x.m_Map[i][j] = noise(i,j,rand());			}		}	};

Just found the problem it caused a stack overflow i moved the instance of map32 maps out of the main function and it worked fine

i didn t know that the stack is so tiny
the map32 class is only 1 mb large

##### Share on other sites
Your problem is tons of redundant memory access and tons of unneccessary modulus ops. Are you really doing all those mods just to handle the boundary cases? You should handle those separately and write code for your inner loop that makes assumptions about the input (like no modulus is necessary).

I've had to deal with this problem in the past, and here was my solution. Since we're accessing 3 rows of the grid in a sequence, just do explicitly that except storing everything in local variables instead of re-accessing the array (no way will a compiler ever do this for you!).

So a first try would be to process a row as a strip. Imagine the current 9 points you're looking at are:
A D GB E HC F I

and you're computing E. On the next iteration you'll compute H. To compute H, you'll need to load in a new column of 3 points, but at that point you'll no longer need A,B or C, so you can just overwrite the new column into the positions of A B and C and process it by unrolling the loop 3 times and doing it like:

load GHI, process E with:A D GB E HC F Iload next column into ABC and process H with:  D G A  E H B  F I Cload next column into DEF and process B with:    G A D    H B E    I C F

You must unroll the loop 3 times with this approach because even though you're doing the same operation in each, the different variables have different semantic values in each (ABC means 1st, 3rd, or 2nd column depending on 1st,2nd, or 3rd iteration in the unroll), but notice that once GHI gets loaded it isn't changed for 3 iterations, that is where the advantage is.

Anyway, if you're looking to really optimize this, this is the sort of thing that will get you the big gains. Other approaches that still require redundant memory access will still be slow. Optimization is always the same old story sung with a different tune, and the theme is always memory access.

##### Share on other sites
ok i will implement what you descriped *its benchmark time *G**

thx

##### Share on other sites
Yup, good ajas95. I had thought of that, but wasn't going to mention everything I could think of straight away, for fear of premature optimisation. Making the code much longer is more likely to introduce bugs, and it's possibly not worth risking introducing bugs by optimising it much furthur. Many of those reads are going to be from the cache anyway. But ah what the heck!

Posting benchmarks would be nice.[smile]

The code I've posted takes care of all of the moduluses btw.

btw Basiror, just declare that large 2D array that was exceeding the stack size as static. There's no need to increase the scope of the variable by moving it out of main and making it global. Just be aware that doing either of these things, in general, makes it no longer thread-safe, if that is at all of concern to you.

##### Share on other sites
yes actually i should have allocated it on the heap
as soon as i have implemented the rest of the features i will implement the optimizations you all suggested (eleminating the moduluses and the reuse of calculated columns)