Sign in to follow this  

Optimizing math code in a templated struct (g++)

This topic is 4300 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, I am trying to write a good solid vector math class using templates. I ran some tests today, and well my version is performing quite poorly in comparison to straight non-template code. To explain, this is a function in the template class: (times included are for 10000 runs, in microseconds)
	
	friend inline float length(const vector_base& a) // N = 3 in this case
	{
		float _result = 0;
		for (unsigned int j = 0; j < N; ++j) _result += a.components[j] * a.components[j];
		return std::sqrt(_result);
	}
-O flag -> 598, -O3 -> 827
/*
I know you will ask why it isn't a member function: it was 15 minutes ago, 
just changed it for some tests ... anyway performance is the same. 
The equivalent non-template function:
*/

float length_ord(float ax, float ay, float az)
{
	return std::sqrt(ax * ax + ay * ay + az * az);
}
-O flag -> 70, -O3 -> 26
Questions: 1. I know the addition in the template code is taking place in N steps not to mention the temporary object, is there any way I can improve that function? 2. Using -O2 or -O3 worsens performance for the template class function, is this normal? 3. Are there any other tips for optimizing the performance? Thanks!

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
uh huh you are trying to optimize code in a template. I suggest you learn the basics before trying to optimize.

Share this post


Link to post
Share on other sites
To be fair, you would have to be passing a struct/class to length_ord, not 3 individual floats. e.g.
float length_ord(const vec3 &a)
Otherwise it might put stuff in registers, instead of read from RAM, which will of course be much faster, especially if your benchmark operates on the same 3 floats all the time.
You'll have to post your benchmark code, though I can pretty much guarantee that its an invalid test anyway.

Share this post


Link to post
Share on other sites
Back up and lets look at the smallest case.

Lets make a toy program that tests nothing but what we are working on:

template<int n>
struct FloatArray {
float data[n];
};

template<int n>
float distance1( FloatArray<n> array )
{
float sum = 0;
{for (int i = 0; i < n; ++i) {
float squared = array.data[i]*array.data[i];
sum = sum + squared;
}}
float result = std::sqrt(sum);
return result;
}

float distance2( FloatArray<3> array ) {
{
enum {n = 3};
float sum = 0;
{for (int i = 0; i < n; ++i) {
float squared = array.data[i]*array.data[i];
sum = sum + squared;
}}
float result = std::sqrt(sum);
return result;
}

float distance2_point5( FloatArray<3> array ) {
{
float sum = 0;
{for (int i = 0; i < 3; ++i) {
float squared = array.data[i]*array.data[i];
sum = sum + squared;
}}
float result = std::sqrt(sum);
return result;
}

float distance3( FloatArray<3> array ) {
{
float sum = 0;
sum = sum + array.data[0]*array.data[0] + array.data[1]*array.data[1] + array.data[2]*array.data[2];
float result = std::sqrt(sum);
return result;
}

float distance4( float data1, float data2, float data3 )
{
float sum = 0;
sum = sum + data1*data1 + data2*data2 + data3*data3;
float result = std::sqrt(sum);
return result;
}

float distance5( float data1, float data2, float data3 )
{
float result = std::sqrt(data1*data1 + data2*data2 + data3*data3);
return result;
}

float distance6( float data1, float data2, float data3 )
{
return std::sqrt(data1*data1 + data2*data2 + data3*data3);
}


Do you have access to the dissassembly? Being able to look at what the optimizer does is important, even outside of benchmarking.

Knowing what kind of code your optimizer will understand can be useful. =)

Sometimes passing by value can be faster than passing by reference.

Other options that may be useful:
Quote:
-ffast-math
Might allow some programs designed to not be too dependent on IEEE behavior for floating-point to run faster, or die trying. Sets `-funsafe-math-optimizations', and `-fno-trapping-math'.

-funsafe-math-optimizations
Allow optimizations that may be give incorrect results for certain IEEE inputs.

-fno-trapping-math
Allow the compiler to assume that floating-point arithmetic will not generate traps on any inputs. This is useful, for example, when running a program using IEEE "non-stop" floating-point arithmetic.

Share this post


Link to post
Share on other sites
You can try something like this:

template<int N>
struct FloatArray {
float data[N];
};

template <int K>
struct Distance
{
template <int N>
static inline float dist2(const FloatArray<N>& array)
{
return Distance<K-1>::dist2(array) + array.data[K]*array.data[K];
}
};

template <>
struct Distance<0>
{
template <int N>
static inline float dist2(const FloatArray<N>& array)
{
return array.data[0]*array.data[0];
}
};

template<int N>
float dist( const FloatArray<N>& array )
{
return std::sqrt(Distance<N-1>::dist2(array));
}


Turn on -O2. This will inline all function calls. The generated assembler code should look similar to this one:

00410480 <float dist<3>(FloatArray<3> const&)>:
410480: 55 push %ebp
410481: 89 e5 mov %esp,%ebp
410483: 83 ec 08 sub $0x8,%esp
410486: 8b 55 08 mov 0x8(%ebp),%edx
410489: d9 02 flds (%edx)
41048b: d9 42 04 flds 0x4(%edx)
41048e: d9 c9 fxch %st(1)
410490: d8 c8 fmul %st(0),%st
410492: d9 c9 fxch %st(1)
410494: d8 c8 fmul %st(0),%st
410496: de c1 faddp %st,%st(1)
410498: d9 42 08 flds 0x8(%edx)
41049b: d8 c8 fmul %st(0),%st
41049d: de c1 faddp %st,%st(1)
41049f: d9 c0 fld %st(0)
4104a1: d9 fa fsqrt
4104a3: dd e0 fucom %st(0)
4104a5: df e0 fnstsw %ax
4104a7: 9e sahf
4104a8: 7a 06 jp 4104b0 <float dist<3>(FloatArray<3> const&)+0x30>
4104aa: 75 04 jne 4104b0 <float dist<3>(FloatArray<3> const&)+0x30>
4104ac: dd d9 fstp %st(1)
4104ae: c9 leave
4104af: c3 ret
4104b0: dd d8 fstp %st(0)
4104b2: d9 1c 24 fstps (%esp)
4104b5: e8 06 de ff ff call 40e2c0 <sqrtf>
4104ba: c9 leave
4104bb: c3 ret


Share this post


Link to post
Share on other sites
OK, first of all, thank you everyone for replying. AP, I don't understand--is it not possible to optimize template code? I have read articles (well I admit I read just the headlines) that temlpates are used to speed things up, like in the blitz++ library. If you think it's not wroth it, I'll back off.

iMalc, I admit this is the first time I have tried benchmarking something as precise as math functions. I am calling the highperformance timer from glfw, and I go like this:
	current_time = glfwGetTime();

for (int j = 0; j < 10000; ++j)
res += call_length_func();

time_taken = glfwGetTime() - current_time;


Any how, I took note of what you all said. I passed a struct of floats to the length_ord, and got hold of the disassmebly code for both the functions. I can more or less understnad what it is doing for length_ord, but not for the template struct function (looks to me alike it is jumping into itself, for all my knowledge's worth):

Dump of assembler code for function _Z10length_ordRK11floatstruct:
0x08049bfc <_Z10length_ordRK11floatstruct+0>: push %ebp
0x08049bfd <_Z10length_ordRK11floatstruct+1>: mov %esp,%ebp
0x08049bff <_Z10length_ordRK11floatstruct+3>: mov 0x8(%ebp),%eax
0x08049c02 <_Z10length_ordRK11floatstruct+6>: flds (%eax)
0x08049c04 <_Z10length_ordRK11floatstruct+8>: flds 0x4(%eax)
0x08049c07 <_Z10length_ordRK11floatstruct+11>: flds 0x8(%eax)
0x08049c0a <_Z10length_ordRK11floatstruct+14>: fxch %st(2)
0x08049c0c <_Z10length_ordRK11floatstruct+16>: fmul %st(0),%st
0x08049c0e <_Z10length_ordRK11floatstruct+18>: fxch %st(1)
0x08049c10 <_Z10length_ordRK11floatstruct+20>: fmul %st(0),%st
0x08049c12 <_Z10length_ordRK11floatstruct+22>: faddp %st,%st(1)
0x08049c14 <_Z10length_ordRK11floatstruct+24>: fxch %st(1)
0x08049c16 <_Z10length_ordRK11floatstruct+26>: fmul %st(0),%st
0x08049c18 <_Z10length_ordRK11floatstruct+28>: faddp %st,%st(1)
0x08049c1a <_Z10length_ordRK11floatstruct+30>: fsqrt
0x08049c1c <_Z10length_ordRK11floatstruct+32>: pop %ebp
0x08049c1d <_Z10length_ordRK11floatstruct+33>: ret


Dump of assembler code for function _ZN11vector_baseIfLj3EE6lengthEv:
0x08049d88 <_ZN11vector_baseIfLj3EE6lengthEv+0>: push %ebp
0x08049d89 <_ZN11vector_baseIfLj3EE6lengthEv+1>: mov %esp,%ebp
0x08049d8b <_ZN11vector_baseIfLj3EE6lengthEv+3>: sub $0x8,%esp
0x08049d8e <_ZN11vector_baseIfLj3EE6lengthEv+6>: mov 0x8(%ebp),%eax
0x08049d91 <_ZN11vector_baseIfLj3EE6lengthEv+9>: fldz
0x08049d93 <_ZN11vector_baseIfLj3EE6lengthEv+11>: mov $0x3,%edx
0x08049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>: flds (%eax)
0x08049d9a <_ZN11vector_baseIfLj3EE6lengthEv+18>: fmul %st(0),%st
0x08049d9c <_ZN11vector_baseIfLj3EE6lengthEv+20>: faddp %st,%st(1)
0x08049d9e <_ZN11vector_baseIfLj3EE6lengthEv+22>: add $0x4,%eax
0x08049da1 <_ZN11vector_baseIfLj3EE6lengthEv+25>: dec %edx
0x08049da2 <_ZN11vector_baseIfLj3EE6lengthEv+26>: jne 0x8049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>
0x08049da4 <_ZN11vector_baseIfLj3EE6lengthEv+28>: fld %st(0)
0x08049da6 <_ZN11vector_baseIfLj3EE6lengthEv+30>: fsqrt
0x08049da8 <_ZN11vector_baseIfLj3EE6lengthEv+32>: fucom %st(0)
0x08049daa <_ZN11vector_baseIfLj3EE6lengthEv+34>: fnstsw %ax
0x08049dac <_ZN11vector_baseIfLj3EE6lengthEv+36>: and $0x45,%ah
0x08049daf <_ZN11vector_baseIfLj3EE6lengthEv+39>: cmp $0x40,%ah
0x08049db2 <_ZN11vector_baseIfLj3EE6lengthEv+42>: je 0x8049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>
0x08049db4 <_ZN11vector_baseIfLj3EE6lengthEv+44>: fstp %st(0)
---Type <return> to continue, or q <return> to quit---
0x08049db6 <_ZN11vector_baseIfLj3EE6lengthEv+46>: sub $0x10,%esp
0x08049db9 <_ZN11vector_baseIfLj3EE6lengthEv+49>: fstps (%esp)
0x08049dbc <_ZN11vector_baseIfLj3EE6lengthEv+52>: call 0x8049898 <_init+920>
0x08049dc1 <_ZN11vector_baseIfLj3EE6lengthEv+57>: add $0x10,%esp
0x08049dc4 <_ZN11vector_baseIfLj3EE6lengthEv+60>: jmp 0x8049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>
0x08049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>: fstp %st(1)
0x08049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>: leave
0x08049dc9 <_ZN11vector_baseIfLj3EE6lengthEv+65>: ret



Look, I was just wondering if there was something terriblying wrong in my templated function code. I am not trying to optimize too much, just as much is possible on the go.

Share this post


Link to post
Share on other sites
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.

iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.

Thanks for the link/

Share this post


Link to post
Share on other sites
Quote:
Original post by deavik
Quote:
Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.

iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.
I thought it might. Good to hear!

Share this post


Link to post
Share on other sites

This topic is 4300 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this