Sign in to follow this  

Where is the bottleneck in this code?

This topic is 4391 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, i´m currently rewriting my math libraries and for convenient, i want to use one single template class to incorporate vectors and matrices. here is my simple matrix class. I have done a simple speed test on my Athlon XP 1800+ with Windows XP and Microsoft Visual C++ .NET in Release configuration. In a for loop, which runs 10000000 times, i multiply a 3d vector with a 3x3 matrix. I get an average benchmark of 280 ms, which is terrifying slow... :( If i do the same operation with my old 3D vector classes (which are many different, isolated classes without any template code and all the neat stuff), then it takes 0 milliseconds - in other words - so fast, that the time it takes is too little that it can be measured in milliseconds. I think i know the bottleneck... The for loops in the CMatrix::mul method are not unrolled... i have disassembled it. But even with all optimizations in the compiler activated, i didn´t get better benchmarks anyway. Can someone give me some hints for rearranging my code design to make it faster? Thanks in advance, Gammastrahler
template <typename T, int M, int N>
class CMatrix
{
	public:
	T	_v[M][N];

	CMatrix() {}

	inline void load_zero()
	{
	    memset(_v, 0, sizeof(_v));
	}

	template <int P>
	inline CMatrix<T, M, P> mul(const CMatrix<T, N, P> &m)
	{
		CMatrix temp;

		temp.load_zero();

		for (int i = 0; i < M; i++)
			for (int j = 0; j < P; j++)
				for (int k = 0; k < N; k++)
				{
					temp.v(i, j) += _v[i][k] * m.v(k, j);
				}
		return temp;
	}

	inline T &v(int i, int j) { return _v[i][j]; }
	inline const T &v(int i, int j) const { return _v[i][j]; }

	inline T &v(int j) { return _v[0][j]; }
	inline const T &v(int j) const { return _v[0][j]; }
};


typedef CMatrix<float, 4, 4> CMatrix4x4f;
typedef CMatrix<float, 3, 3> CMatrix3x3f;
typedef CMatrix<float, 2, 2> CMatrix2x2f;
typedef CMatrix<double, 4, 4> CMatrix4x4d;
typedef CMatrix<double, 3, 3> CMatrix3x3d;
typedef CMatrix<double, 2, 2> CMatrix2x2d;

typedef CMatrix<float, 1, 4> CVector4f;
typedef CMatrix<float, 1, 3> CVector3f;
typedef CMatrix<float, 1, 2> CVector2f;
typedef CMatrix<double, 1, 4> CVector4d;
typedef CMatrix<double, 1, 3> CVector3d;
typedef CMatrix<double, 1, 2> CVector2d;


Share this post


Link to post
Share on other sites
Quote:
Original post by Gammastrahler
If i do the same operation with my old 3D vector classes (which are many different, isolated classes without any template code and all the neat stuff), then it takes 0 milliseconds - in other words - so fast, that the time it takes is too little that it can be measured in milliseconds.

Er, are you sure it actually takes 0ms? You have to be careful with microbenchmarks that you do something with the result so the compiler doesn't optimise the whole thing out into a no-op. Up your benchmark iterations by a few hundred/thousand/million/whatever and see if you can actually get a non-zero time elapsed...

Share this post


Link to post
Share on other sites
We can't guess where the bottlenecks are. Use a profiler like AMD's CodeAnalyst or Intel's vTune (for MSVC users) or gprof for gcc users (on *nix and Windows). That will give you more information than you will ever need.

The hard part is interpreting the results correctly, and knowing where to target your energies.

Share this post


Link to post
Share on other sites
I can't spot a major bottleneck apart from a few alignment issues, but you really shouldn't do what you are trying to do.
First of all, a matrix is semantically different from vectors. Secondly, if you really want the best performance, you should consider using a three-register scheme instead.

You could get rid of the memset, btw since it's completely useless:

template <int P>
inline CMatrix<T, M, P> mul(const CMatrix<T, N, P> &m)
{
CMatrix temp;

//temp.load_zero();

for (int i = 0; i < M; i++)
for (int j = 0; j < P; j++) {
temp.v(i, j) = _v[i][0] * m.v(0, j);
for (int k = 1; k < N; k++)
{
temp.v(i, j) += _v[i][k] * m.v(k, j);
}
}
return temp;
}


You can also use the compiler for unrolling the loops.

Share this post


Link to post
Share on other sites
Quote:
Original post by Gammastrahler
Hmm, have increased the loop cycles to the maximum long value possible...
it still takes 0 MS...

but with the template code, it takes longer as the cycles increase...

Well your non-templated benchmark is broken then. Fix it first and you'll at least have something to compare your current method with.

Share this post


Link to post
Share on other sites

This topic is 4391 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this