Sign in to follow this  
Basiror

optimize smooth code

Recommended Posts

How would you optimize this code? 1. replacing the '%' operations with simple if else conditions? currently i use the modulo to smooth the transitions 2.maybe one could unroll the loop completely? do compilers unroll loops with 32*32 iterations? or would this blow the code up too much?
//template function for smoothing
	template<class T> void smoothmap(T &x)
	{
		uint32 i = 0;
		uint32 j = 0;

		for(i=1;i<=x.m_uiSize;i++)
		{
			for(j=1;j<=x.m_uiSize;j++)
			{
				x.m_Map[i%x.m_uiSize][j%x.m_uiSize] = 
					//center
					x.m_Map[i][j]*0.25f 
					//diagonal corners
					+ x.m_Map[(i-1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i-1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i+1)%x.m_uiSize][(j-1)%x.m_uiSize]*0.0625f
					+ x.m_Map[(i+1)%x.m_uiSize][(j+1)%x.m_uiSize]*0.0625f
					//horizontal and vertical corners
					+ x.m_Map[(i-1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
					+ x.m_Map[(i+1)%x.m_uiSize][(j)%x.m_uiSize]*0.125f
					+ x.m_Map[(i)%x.m_uiSize][(j-1)%x.m_uiSize]*0.125f
					+ x.m_Map[(i)%x.m_uiSize][(j+1)%x.m_uiSize]*0.125f
					//end of instruction
					;
			}
		}
	}
2. is there a way to define count as a class specific constant? so i don t have to store it in a variable? that way i could pass it to the templated functions and give the compiler the possibility to unroll everything
template<class T, int count>
class someclass
{
private:
T m_Map[count][count];
}


Share this post


Link to post
Share on other sites
ah I think I found the solution

i just add a enum to the class definition like this one


//template quadratic map class
template<class T, uint32 uiSize> class map
{
public:
//typedefs and enums
enum {MAP_SIZE = uiSize};
typedef T type_name;
public:
//member functions constructors & destructors

//fills the map with zero
map():m_uiSize(uiSize)
{
T tFill = 0;
std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);
}

public:
//friend function declarations to access private members
template<class U> friend void fillmap(U &x);
template<class U> friend void smoothmap(U &x);
private:
uint32 m_uiSize;
T m_Map[uiSize][uiSize];
};



that should do the job

Share this post


Link to post
Share on other sites
It would probably be faster to do something like this: (not sure if the compiler is this smart?)


template<class T> void smoothmap(T &x)
{
uint32 i;
uint32 j;
uint32 ti;
uint32 tj;
uint32 t1, t2;
uint32 im,ip,jm,jp;

for(i=1;i<=x.m_uiSize;i++)
{
ti = i%x.m_uiSize;
im = (i-1)%x.m_uiSize;
ip =(i+1)%x.m_uiSize;
for(j=1;j<=x.m_uiSize;j++)
{
tj = j%x.m_uiSize;
jm=(j-1)%x.m_uiSize;
jp=(j+1)%x.m_uiSize;

x.m_Map[ti][tj]>>=2;
//center point

//diagonal corners
t1 = (x.m_Map[im][jm]
+ x.m_Map[im][jp]
+ x.m_Map[ip][jm]
+ x.m_Map[ip][jp])>>4;

//horizontal and vertical corners
t2 = (x.m_Map[im][tj]
+ x.m_Map[ip][tj]
+ x.m_Map[ti][jm]
+ x.m_Map[ti][jp])>>3;

x.m_Map[ti][tj]+= t1+t2;
}
}
}



This removes a bunch of modules operations, and removes a bunch of multiplication operations as well. Of course there are more optimizations, but this should be a good start. By shifting by 4, it is the same as dividing by 16 (which is what you were trying to do with your mult), and 3 is equivalant to dividing by 8 (as far as integer math is concerned!). If you're using floats, you must replace >>4 with *0.0625f and >>3 with *0.125f, and >>=2 with *=0.25f but that still saves 6 mults per iteration ;). Why do you use [i%x.m_uiSize] and another spot use [i]? You realize that if you're going from 1 -> uiSize, it will be reading/writing an invalid memory location right? My solution fixes this problem by using modulus for both of them, without actually requiring any extra operations (just extra variables). Each of the modulus for the i loop only has to be executed uiSize times rather than uiSize*uiSize, and it removes the # of times that you repeat the exact same operation ;).

Share this post


Link to post
Share on other sites
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.

You can always look at the compiler assembly output to verify this sort of thing.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anon Mike
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.


Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.

[Edited by - Promit on August 1, 2005 7:27:06 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Quote:
Original post by Anon Mike
Replacing division of powers of 2 by shifts just makes the code harder to read. The compiler is indeed smart enough to apply decades-old point optimizations - assuming optimizations are turning on at all of course.


Besides, shifts may not actually be faster. The Pentium 4, for example, doesn't do the shift in hardware, but instead uses an integer divide, iirc.


erm, luckily I have only bought AMDs for arond 5 years now *G*

@ready4dis: i just missed the % in this case but I thnk I will modify it to what you suggested

Share this post


Link to post
Share on other sites
so i have applied the changes


//template function for smoothing
template<class T> void smoothmap(T &x)
{
uint32 i;
uint32 j;

uint32 ti1,ti2,ti3;
uint32 tj1,tj2,tj3;

for(i=1;i<=T::MAP_SIZE;i++)
{
//do modulos only once
ti1 = i%T::MAP_SIZE;
ti2 = (i+1)%T::MAP_SIZE;
ti3 = (i-1)%T::MAP_SIZE;

for(j=1;j<=T::MAP_SIZE;j++)
{
tj1 = j%T::MAP_SIZE;
tj2 = (j+1)%T::MAP_SIZE;
tj3 = (j-1)%T::MAP_SIZE;

x.m_Map[ti1][tj1] =
//center
x.m_Map[ti1][tj2]*0.25f
//diagonal corners
+ (x.m_Map[ti3][tj3]
+ x.m_Map[ti3][tj2]
+ x.m_Map[ti2][tj3]
+ x.m_Map[ti2][tj2])*0.0625f
//horizontal and vertical corners
+ (x.m_Map[ti3][tj1]
+ x.m_Map[ti2][tj1]
+ x.m_Map[ti1][tj3]
+ x.m_Map[ti1][tj2])*0.125f;
//end of instruction
;
}
}
}



as for the floating point multiplications, the m_Map values should be floats in general and i guess if i would use integers the compiler would optimize it to shifts on his own

Share this post


Link to post
Share on other sites
You don't want the compiler to unroll that, it will probably only worsen performance. There are more higher level optimisations left which would actually make a difference.

These moduluses are redundant, so you may as well remove them:
			ti3 = (i-1)%T::MAP_SIZE;
tj3 = (j-1)%T::MAP_SIZE;
Actually, given that it shouldn't matter in what order the smoothing function is applied to the grid elements, there's a cunning way of doing it without any modulus at all (or even an if!)
for (ti1=0, ti2=T::MAP_SIZE-1, ti3=1; ti2 >= 0; ti3=ti1, ti1=ti2, --ti2)
for (tj1=0, tj2=T::MAP_SIZE-1, tj3=1; tj2 >= 0; tj3=tj1, tj1=tj2, --tj2)
This even eliminates i and j too.[cool]

Anyway, by now I assume it's fast enough, if not, the memory access is probably the bottleneck at this stage.[grin]

Share this post


Link to post
Share on other sites
yeah what i could think of would be reorder the memory access to reduce memory jumping if that is still an issue at all with all those optimizations

i assume the compiler prefetches the array on its own so i don t have to add a prefetch instruction right?

Share this post


Link to post
Share on other sites
I run in a little problem now
at the top you see the typedef with 256 which created a cross:map with an array
T[256][256]

so far so good everything works ok as long as i don t insert a nummer > 256
if i insert 512 the application instantly crashes without printing the std:cerr at all

so something must be screwed i use the VC++ toolkit 2003
any idea on what s going wrong?

this piece of code instanciates the map

typedef cross::map<float,256> map32;
int main(int argc,char *argv[])
{
std::cerr<<"short: "<<map32::MAP_SIZE<<std::endl;

map32 maps;

cross::fillmap<map32>(maps);
cross::smoothmap<map32>(maps);
cross::smoothmap<map32>(maps);
cross::smoothmap<map32>(maps);

maps.dump8(std::string("test.raw"));


return 0;
}





2. class declaration and implementation of fillmap +dump8

//template quadratic map class
template<class T, uint32 uiSize> class map
{
public:
//typedefs and enums

//type specific definition of size to allow unrolling of loops
enum {MAP_SIZE = uiSize};
typedef T type_name;
public:
//member functions constructors & destructors

//fills the map with zero
map():m_uiSize(uiSize)
{
T tFill = 0;
std::fill<T*,T>(&m_Map[0][0],&m_Map[uiSize-1][uiSize-1],tFill);
};

//dumps a 8bit heightmap
void dump8(std::string strFilename)
{
uint32 i;
uint32 j;

std::ofstream file;
file.open(strFilename.c_str());

for(i=0;i<uiSize;i++)
{
for(j=0;j<uiSize;j++)
{
file << static_cast<uchar8>((m_Map[i][j] + 1.0f)*127);
}
}

file.close();
};

public:
//friend function declarations to access private members
template<class U> friend void fillmap(U &x);
template<class U> friend void smoothmap(U &x);
private:
uint32 m_uiSize;
T m_Map[MAP_SIZE][uiSize];
};

//template function for noise generation
template<class T> void fillmap(T &x)
{
uint32 i;
uint32 j;

//loop through the rows
for(i=0;i<T::MAP_SIZE;i++)
{
//loop through the colums
for(j=0;j<T::MAP_SIZE;j++)
{
//set a new random seed everytime
srand(static_cast<int>(time(NULL)));

//retrieve a noise value with a new random pattern
x.m_Map[i][j] = noise(i,j,rand());
}
}
};






Just found the problem it caused a stack overflow i moved the instance of map32 maps out of the main function and it worked fine

i didn t know that the stack is so tiny
the map32 class is only 1 mb large

Share this post


Link to post
Share on other sites
Your problem is tons of redundant memory access and tons of unneccessary modulus ops. Are you really doing all those mods just to handle the boundary cases? You should handle those separately and write code for your inner loop that makes assumptions about the input (like no modulus is necessary).

I've had to deal with this problem in the past, and here was my solution. Since we're accessing 3 rows of the grid in a sequence, just do explicitly that except storing everything in local variables instead of re-accessing the array (no way will a compiler ever do this for you!).

So a first try would be to process a row as a strip. Imagine the current 9 points you're looking at are:

A D G
B E H
C F I

and you're computing E. On the next iteration you'll compute H. To compute H, you'll need to load in a new column of 3 points, but at that point you'll no longer need A,B or C, so you can just overwrite the new column into the positions of A B and C and process it by unrolling the loop 3 times and doing it like:

assume ABC, DEF already loaded...

load GHI, process E with:
A D G
B E H
C F I
load next column into ABC and process H with:
D G A
E H B
F I C
load next column into DEF and process B with:
G A D
H B E
I C F

You must unroll the loop 3 times with this approach because even though you're doing the same operation in each, the different variables have different semantic values in each (ABC means 1st, 3rd, or 2nd column depending on 1st,2nd, or 3rd iteration in the unroll), but notice that once GHI gets loaded it isn't changed for 3 iterations, that is where the advantage is.

Anyway, if you're looking to really optimize this, this is the sort of thing that will get you the big gains. Other approaches that still require redundant memory access will still be slow. Optimization is always the same old story sung with a different tune, and the theme is always memory access.

Share this post


Link to post
Share on other sites
Yup, good ajas95. I had thought of that, but wasn't going to mention everything I could think of straight away, for fear of premature optimisation. Making the code much longer is more likely to introduce bugs, and it's possibly not worth risking introducing bugs by optimising it much furthur. Many of those reads are going to be from the cache anyway. But ah what the heck!

Posting benchmarks would be nice.[smile]

The code I've posted takes care of all of the moduluses btw.

btw Basiror, just declare that large 2D array that was exceeding the stack size as static. There's no need to increase the scope of the variable by moving it out of main and making it global. Just be aware that doing either of these things, in general, makes it no longer thread-safe, if that is at all of concern to you.

Share this post


Link to post
Share on other sites
yes actually i should have allocated it on the heap
as soon as i have implemented the rest of the features i will implement the optimizations you all suggested (eleminating the moduluses and the reuse of calculated columns)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this