Optimizing math code in a templated struct (g++)

Started by
8 comments, last by iMalc 18 years, 1 month ago
Hi, I am trying to write a good solid vector math class using templates. I ran some tests today, and well my version is performing quite poorly in comparison to straight non-template code. To explain, this is a function in the template class: (times included are for 10000 runs, in microseconds)
	
	friend inline float length(const vector_base& a) // N = 3 in this case
	{
		float _result = 0;
		for (unsigned int j = 0; j < N; ++j) _result += a.components[j] * a.components[j];
		return std::sqrt(_result);
	}
-O flag -> 598, -O3 -> 827
/*
I know you will ask why it isn't a member function: it was 15 minutes ago, 
just changed it for some tests ... anyway performance is the same. 
The equivalent non-template function:
*/

float length_ord(float ax, float ay, float az)
{
	return std::sqrt(ax * ax + ay * ay + az * az);
}
-O flag -> 70, -O3 -> 26
Questions: 1. I know the addition in the template code is taking place in N steps not to mention the temporary object, is there any way I can improve that function? 2. Using -O2 or -O3 worsens performance for the template class function, is this normal? 3. Are there any other tips for optimizing the performance? Thanks!
Advertisement
uh huh you are trying to optimize code in a template. I suggest you learn the basics before trying to optimize.
debug or release mode? release mode code often unrolls loops if needed adding to a speed up. secondly, how reliable are your performance tests?
To be fair, you would have to be passing a struct/class to length_ord, not 3 individual floats. e.g.
float length_ord(const vec3 &a)
Otherwise it might put stuff in registers, instead of read from RAM, which will of course be much faster, especially if your benchmark operates on the same 3 floats all the time.
You'll have to post your benchmark code, though I can pretty much guarantee that its an invalid test anyway.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Back up and lets look at the smallest case.

Lets make a toy program that tests nothing but what we are working on:

template<int n>struct FloatArray {  float data[n];};template<int n>float distance1( FloatArray<n> array ){  float sum = 0;  {for (int i = 0; i < n; ++i) {    float squared = array.data*array.data;    sum = sum + squared;  }}  float result = std::sqrt(sum);  return result;}float distance2( FloatArray<3> array ) {{  enum {n = 3};  float sum = 0;  {for (int i = 0; i < n; ++i) {    float squared = array.data*array.data;    sum = sum + squared;  }}  float result = std::sqrt(sum);  return result;}float distance2_point5( FloatArray<3> array ) {{  float sum = 0;  {for (int i = 0; i < 3; ++i) {    float squared = array.data*array.data;    sum = sum + squared;  }}  float result = std::sqrt(sum);  return result;}float distance3( FloatArray<3> array ) {{  float sum = 0;  sum = sum + array.data[0]*array.data[0] + array.data[1]*array.data[1] + array.data[2]*array.data[2];  float result = std::sqrt(sum);  return result;}float distance4( float data1, float data2, float data3 ){  float sum = 0;  sum = sum + data1*data1 + data2*data2 + data3*data3;  float result = std::sqrt(sum);  return result;}float distance5( float data1, float data2, float data3 ){  float result = std::sqrt(data1*data1 + data2*data2 + data3*data3);  return result;}float distance6( float data1, float data2, float data3 ){  return std::sqrt(data1*data1 + data2*data2 + data3*data3);}


Do you have access to the dissassembly? Being able to look at what the optimizer does is important, even outside of benchmarking.

Knowing what kind of code your optimizer will understand can be useful. =)

Sometimes passing by value can be faster than passing by reference.

Other options that may be useful:
Quote:-ffast-math
Might allow some programs designed to not be too dependent on IEEE behavior for floating-point to run faster, or die trying. Sets `-funsafe-math-optimizations', and `-fno-trapping-math'.

-funsafe-math-optimizations
Allow optimizations that may be give incorrect results for certain IEEE inputs.

-fno-trapping-math
Allow the compiler to assume that floating-point arithmetic will not generate traps on any inputs. This is useful, for example, when running a program using IEEE "non-stop" floating-point arithmetic.
You can try something like this:
template<int N>struct FloatArray {  float data[N];};template <int K>struct Distance{	template <int N>	static inline float dist2(const FloatArray<N>& array)	{		return Distance<K-1>::dist2(array) + array.data[K]*array.data[K];	}};	template <>struct Distance<0>{	template <int N>	static inline float dist2(const FloatArray<N>& array)	{		return array.data[0]*array.data[0];	}};template<int N>float dist( const FloatArray<N>& array ){	return std::sqrt(Distance<N-1>::dist2(array));}


Turn on -O2. This will inline all function calls. The generated assembler code should look similar to this one:
00410480 <float dist<3>(FloatArray<3> const&)>:  410480:       55                      push   %ebp  410481:       89 e5                   mov    %esp,%ebp  410483:       83 ec 08                sub    $0x8,%esp  410486:       8b 55 08                mov    0x8(%ebp),%edx  410489:       d9 02                   flds   (%edx)  41048b:       d9 42 04                flds   0x4(%edx)  41048e:       d9 c9                   fxch   %st(1)  410490:       d8 c8                   fmul   %st(0),%st  410492:       d9 c9                   fxch   %st(1)  410494:       d8 c8                   fmul   %st(0),%st  410496:       de c1                   faddp  %st,%st(1)  410498:       d9 42 08                flds   0x8(%edx)  41049b:       d8 c8                   fmul   %st(0),%st  41049d:       de c1                   faddp  %st,%st(1)  41049f:       d9 c0                   fld    %st(0)  4104a1:       d9 fa                   fsqrt    4104a3:       dd e0                   fucom  %st(0)  4104a5:       df e0                   fnstsw %ax  4104a7:       9e                      sahf     4104a8:       7a 06                   jp     4104b0 <float dist<3>(FloatArray<3> const&)+0x30>  4104aa:       75 04                   jne    4104b0 <float dist<3>(FloatArray<3> const&)+0x30>  4104ac:       dd d9                   fstp   %st(1)  4104ae:       c9                      leave    4104af:       c3                      ret      4104b0:       dd d8                   fstp   %st(0)  4104b2:       d9 1c 24                fstps  (%esp)  4104b5:       e8 06 de ff ff          call   40e2c0 <sqrtf>  4104ba:       c9                      leave    4104bb:       c3                      ret    


OK, first of all, thank you everyone for replying. AP, I don't understand--is it not possible to optimize template code? I have read articles (well I admit I read just the headlines) that temlpates are used to speed things up, like in the blitz++ library. If you think it's not wroth it, I'll back off.

iMalc, I admit this is the first time I have tried benchmarking something as precise as math functions. I am calling the highperformance timer from glfw, and I go like this:
	current_time = glfwGetTime();		for (int j = 0; j < 10000; ++j)		res += call_length_func();		time_taken = glfwGetTime() - current_time;


Any how, I took note of what you all said. I passed a struct of floats to the length_ord, and got hold of the disassmebly code for both the functions. I can more or less understnad what it is doing for length_ord, but not for the template struct function (looks to me alike it is jumping into itself, for all my knowledge's worth):

Dump of assembler code for function _Z10length_ordRK11floatstruct:0x08049bfc <_Z10length_ordRK11floatstruct+0>:   push   %ebp0x08049bfd <_Z10length_ordRK11floatstruct+1>:   mov    %esp,%ebp0x08049bff <_Z10length_ordRK11floatstruct+3>:   mov    0x8(%ebp),%eax0x08049c02 <_Z10length_ordRK11floatstruct+6>:   flds   (%eax)0x08049c04 <_Z10length_ordRK11floatstruct+8>:   flds   0x4(%eax)0x08049c07 <_Z10length_ordRK11floatstruct+11>:  flds   0x8(%eax)0x08049c0a <_Z10length_ordRK11floatstruct+14>:  fxch   %st(2)0x08049c0c <_Z10length_ordRK11floatstruct+16>:  fmul   %st(0),%st0x08049c0e <_Z10length_ordRK11floatstruct+18>:  fxch   %st(1)0x08049c10 <_Z10length_ordRK11floatstruct+20>:  fmul   %st(0),%st0x08049c12 <_Z10length_ordRK11floatstruct+22>:  faddp  %st,%st(1)0x08049c14 <_Z10length_ordRK11floatstruct+24>:  fxch   %st(1)0x08049c16 <_Z10length_ordRK11floatstruct+26>:  fmul   %st(0),%st0x08049c18 <_Z10length_ordRK11floatstruct+28>:  faddp  %st,%st(1)0x08049c1a <_Z10length_ordRK11floatstruct+30>:  fsqrt0x08049c1c <_Z10length_ordRK11floatstruct+32>:  pop    %ebp0x08049c1d <_Z10length_ordRK11floatstruct+33>:  retDump of assembler code for function _ZN11vector_baseIfLj3EE6lengthEv:0x08049d88 <_ZN11vector_baseIfLj3EE6lengthEv+0>:        push   %ebp0x08049d89 <_ZN11vector_baseIfLj3EE6lengthEv+1>:        mov    %esp,%ebp0x08049d8b <_ZN11vector_baseIfLj3EE6lengthEv+3>:        sub    $0x8,%esp0x08049d8e <_ZN11vector_baseIfLj3EE6lengthEv+6>:        mov    0x8(%ebp),%eax0x08049d91 <_ZN11vector_baseIfLj3EE6lengthEv+9>:        fldz0x08049d93 <_ZN11vector_baseIfLj3EE6lengthEv+11>:       mov    $0x3,%edx0x08049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>:       flds   (%eax)0x08049d9a <_ZN11vector_baseIfLj3EE6lengthEv+18>:       fmul   %st(0),%st0x08049d9c <_ZN11vector_baseIfLj3EE6lengthEv+20>:       faddp  %st,%st(1)0x08049d9e <_ZN11vector_baseIfLj3EE6lengthEv+22>:       add    $0x4,%eax0x08049da1 <_ZN11vector_baseIfLj3EE6lengthEv+25>:       dec    %edx0x08049da2 <_ZN11vector_baseIfLj3EE6lengthEv+26>:       jne    0x8049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>0x08049da4 <_ZN11vector_baseIfLj3EE6lengthEv+28>:       fld    %st(0)0x08049da6 <_ZN11vector_baseIfLj3EE6lengthEv+30>:       fsqrt0x08049da8 <_ZN11vector_baseIfLj3EE6lengthEv+32>:       fucom  %st(0)0x08049daa <_ZN11vector_baseIfLj3EE6lengthEv+34>:       fnstsw %ax0x08049dac <_ZN11vector_baseIfLj3EE6lengthEv+36>:       and    $0x45,%ah0x08049daf <_ZN11vector_baseIfLj3EE6lengthEv+39>:       cmp    $0x40,%ah0x08049db2 <_ZN11vector_baseIfLj3EE6lengthEv+42>:       je     0x8049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>0x08049db4 <_ZN11vector_baseIfLj3EE6lengthEv+44>:       fstp   %st(0)---Type <return> to continue, or q <return> to quit---0x08049db6 <_ZN11vector_baseIfLj3EE6lengthEv+46>:       sub    $0x10,%esp0x08049db9 <_ZN11vector_baseIfLj3EE6lengthEv+49>:       fstps  (%esp)0x08049dbc <_ZN11vector_baseIfLj3EE6lengthEv+52>:       call   0x8049898 <_init+920>0x08049dc1 <_ZN11vector_baseIfLj3EE6lengthEv+57>:       add    $0x10,%esp0x08049dc4 <_ZN11vector_baseIfLj3EE6lengthEv+60>:       jmp    0x8049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>0x08049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>:       fstp   %st(1)0x08049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>:       leave0x08049dc9 <_ZN11vector_baseIfLj3EE6lengthEv+65>:       ret


Look, I was just wondering if there was something terriblying wrong in my templated function code. I am not trying to optimize too much, just as much is possible on the go.
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.

iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.

Thanks for the link/
Quote:Original post by deavik
Quote:Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.

iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.
I thought it might. Good to hear!
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement