friend inline float length(const vector_base& a) // N = 3 in this case
{
float _result = 0;
for (unsigned int j = 0; j < N; ++j) _result += a.components[j] * a.components[j];
return std::sqrt(_result);
}
-O flag -> 598, -O3 -> 827
/*
I know you will ask why it isn't a member function: it was 15 minutes ago,
just changed it for some tests ... anyway performance is the same.
The equivalent non-template function:
*/
float length_ord(float ax, float ay, float az)
{
return std::sqrt(ax * ax + ay * ay + az * az);
}
-O flag -> 70, -O3 -> 26
Optimizing math code in a templated struct (g++)
Hi, I am trying to write a good solid vector math class using templates. I ran some tests today, and well my version is performing quite poorly in comparison to straight non-template code. To explain, this is a function in the template class: (times included are for 10000 runs, in microseconds)
Questions:
1. I know the addition in the template code is taking place in N steps not to mention the temporary object, is there any way I can improve that function?
2. Using -O2 or -O3 worsens performance for the template class function, is this normal?
3. Are there any other tips for optimizing the performance?
Thanks!
uh huh you are trying to optimize code in a template. I suggest you learn the basics before trying to optimize.
debug or release mode? release mode code often unrolls loops if needed adding to a speed up. secondly, how reliable are your performance tests?
To be fair, you would have to be passing a struct/class to length_ord, not 3 individual floats. e.g.
You'll have to post your benchmark code, though I can pretty much guarantee that its an invalid test anyway.
float length_ord(const vec3 &a)
Otherwise it might put stuff in registers, instead of read from RAM, which will of course be much faster, especially if your benchmark operates on the same 3 floats all the time.You'll have to post your benchmark code, though I can pretty much guarantee that its an invalid test anyway.
Back up and lets look at the smallest case.
Lets make a toy program that tests nothing but what we are working on:
Do you have access to the dissassembly? Being able to look at what the optimizer does is important, even outside of benchmarking.
Knowing what kind of code your optimizer will understand can be useful. =)
Sometimes passing by value can be faster than passing by reference.
Other options that may be useful:
Lets make a toy program that tests nothing but what we are working on:
template<int n>struct FloatArray { float data[n];};template<int n>float distance1( FloatArray<n> array ){ float sum = 0; {for (int i = 0; i < n; ++i) { float squared = array.data*array.data; sum = sum + squared; }} float result = std::sqrt(sum); return result;}float distance2( FloatArray<3> array ) {{ enum {n = 3}; float sum = 0; {for (int i = 0; i < n; ++i) { float squared = array.data*array.data; sum = sum + squared; }} float result = std::sqrt(sum); return result;}float distance2_point5( FloatArray<3> array ) {{ float sum = 0; {for (int i = 0; i < 3; ++i) { float squared = array.data*array.data; sum = sum + squared; }} float result = std::sqrt(sum); return result;}float distance3( FloatArray<3> array ) {{ float sum = 0; sum = sum + array.data[0]*array.data[0] + array.data[1]*array.data[1] + array.data[2]*array.data[2]; float result = std::sqrt(sum); return result;}float distance4( float data1, float data2, float data3 ){ float sum = 0; sum = sum + data1*data1 + data2*data2 + data3*data3; float result = std::sqrt(sum); return result;}float distance5( float data1, float data2, float data3 ){ float result = std::sqrt(data1*data1 + data2*data2 + data3*data3); return result;}float distance6( float data1, float data2, float data3 ){ return std::sqrt(data1*data1 + data2*data2 + data3*data3);}
Do you have access to the dissassembly? Being able to look at what the optimizer does is important, even outside of benchmarking.
Knowing what kind of code your optimizer will understand can be useful. =)
Sometimes passing by value can be faster than passing by reference.
Other options that may be useful:
Quote:-ffast-math
Might allow some programs designed to not be too dependent on IEEE behavior for floating-point to run faster, or die trying. Sets `-funsafe-math-optimizations', and `-fno-trapping-math'.
-funsafe-math-optimizations
Allow optimizations that may be give incorrect results for certain IEEE inputs.
-fno-trapping-math
Allow the compiler to assume that floating-point arithmetic will not generate traps on any inputs. This is useful, for example, when running a program using IEEE "non-stop" floating-point arithmetic.
You can try something like this:
Turn on -O2. This will inline all function calls. The generated assembler code should look similar to this one:
template<int N>struct FloatArray { float data[N];};template <int K>struct Distance{ template <int N> static inline float dist2(const FloatArray<N>& array) { return Distance<K-1>::dist2(array) + array.data[K]*array.data[K]; }}; template <>struct Distance<0>{ template <int N> static inline float dist2(const FloatArray<N>& array) { return array.data[0]*array.data[0]; }};template<int N>float dist( const FloatArray<N>& array ){ return std::sqrt(Distance<N-1>::dist2(array));}
Turn on -O2. This will inline all function calls. The generated assembler code should look similar to this one:
00410480 <float dist<3>(FloatArray<3> const&)>: 410480: 55 push %ebp 410481: 89 e5 mov %esp,%ebp 410483: 83 ec 08 sub $0x8,%esp 410486: 8b 55 08 mov 0x8(%ebp),%edx 410489: d9 02 flds (%edx) 41048b: d9 42 04 flds 0x4(%edx) 41048e: d9 c9 fxch %st(1) 410490: d8 c8 fmul %st(0),%st 410492: d9 c9 fxch %st(1) 410494: d8 c8 fmul %st(0),%st 410496: de c1 faddp %st,%st(1) 410498: d9 42 08 flds 0x8(%edx) 41049b: d8 c8 fmul %st(0),%st 41049d: de c1 faddp %st,%st(1) 41049f: d9 c0 fld %st(0) 4104a1: d9 fa fsqrt 4104a3: dd e0 fucom %st(0) 4104a5: df e0 fnstsw %ax 4104a7: 9e sahf 4104a8: 7a 06 jp 4104b0 <float dist<3>(FloatArray<3> const&)+0x30> 4104aa: 75 04 jne 4104b0 <float dist<3>(FloatArray<3> const&)+0x30> 4104ac: dd d9 fstp %st(1) 4104ae: c9 leave 4104af: c3 ret 4104b0: dd d8 fstp %st(0) 4104b2: d9 1c 24 fstps (%esp) 4104b5: e8 06 de ff ff call 40e2c0 <sqrtf> 4104ba: c9 leave 4104bb: c3 ret
OK, first of all, thank you everyone for replying. AP, I don't understand--is it not possible to optimize template code? I have read articles (well I admit I read just the headlines) that temlpates are used to speed things up, like in the blitz++ library. If you think it's not wroth it, I'll back off.
iMalc, I admit this is the first time I have tried benchmarking something as precise as math functions. I am calling the highperformance timer from glfw, and I go like this:
Any how, I took note of what you all said. I passed a struct of floats to the length_ord, and got hold of the disassmebly code for both the functions. I can more or less understnad what it is doing for length_ord, but not for the template struct function (looks to me alike it is jumping into itself, for all my knowledge's worth):
Look, I was just wondering if there was something terriblying wrong in my templated function code. I am not trying to optimize too much, just as much is possible on the go.
iMalc, I admit this is the first time I have tried benchmarking something as precise as math functions. I am calling the highperformance timer from glfw, and I go like this:
current_time = glfwGetTime(); for (int j = 0; j < 10000; ++j) res += call_length_func(); time_taken = glfwGetTime() - current_time;
Any how, I took note of what you all said. I passed a struct of floats to the length_ord, and got hold of the disassmebly code for both the functions. I can more or less understnad what it is doing for length_ord, but not for the template struct function (looks to me alike it is jumping into itself, for all my knowledge's worth):
Dump of assembler code for function _Z10length_ordRK11floatstruct:0x08049bfc <_Z10length_ordRK11floatstruct+0>: push %ebp0x08049bfd <_Z10length_ordRK11floatstruct+1>: mov %esp,%ebp0x08049bff <_Z10length_ordRK11floatstruct+3>: mov 0x8(%ebp),%eax0x08049c02 <_Z10length_ordRK11floatstruct+6>: flds (%eax)0x08049c04 <_Z10length_ordRK11floatstruct+8>: flds 0x4(%eax)0x08049c07 <_Z10length_ordRK11floatstruct+11>: flds 0x8(%eax)0x08049c0a <_Z10length_ordRK11floatstruct+14>: fxch %st(2)0x08049c0c <_Z10length_ordRK11floatstruct+16>: fmul %st(0),%st0x08049c0e <_Z10length_ordRK11floatstruct+18>: fxch %st(1)0x08049c10 <_Z10length_ordRK11floatstruct+20>: fmul %st(0),%st0x08049c12 <_Z10length_ordRK11floatstruct+22>: faddp %st,%st(1)0x08049c14 <_Z10length_ordRK11floatstruct+24>: fxch %st(1)0x08049c16 <_Z10length_ordRK11floatstruct+26>: fmul %st(0),%st0x08049c18 <_Z10length_ordRK11floatstruct+28>: faddp %st,%st(1)0x08049c1a <_Z10length_ordRK11floatstruct+30>: fsqrt0x08049c1c <_Z10length_ordRK11floatstruct+32>: pop %ebp0x08049c1d <_Z10length_ordRK11floatstruct+33>: retDump of assembler code for function _ZN11vector_baseIfLj3EE6lengthEv:0x08049d88 <_ZN11vector_baseIfLj3EE6lengthEv+0>: push %ebp0x08049d89 <_ZN11vector_baseIfLj3EE6lengthEv+1>: mov %esp,%ebp0x08049d8b <_ZN11vector_baseIfLj3EE6lengthEv+3>: sub $0x8,%esp0x08049d8e <_ZN11vector_baseIfLj3EE6lengthEv+6>: mov 0x8(%ebp),%eax0x08049d91 <_ZN11vector_baseIfLj3EE6lengthEv+9>: fldz0x08049d93 <_ZN11vector_baseIfLj3EE6lengthEv+11>: mov $0x3,%edx0x08049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>: flds (%eax)0x08049d9a <_ZN11vector_baseIfLj3EE6lengthEv+18>: fmul %st(0),%st0x08049d9c <_ZN11vector_baseIfLj3EE6lengthEv+20>: faddp %st,%st(1)0x08049d9e <_ZN11vector_baseIfLj3EE6lengthEv+22>: add $0x4,%eax0x08049da1 <_ZN11vector_baseIfLj3EE6lengthEv+25>: dec %edx0x08049da2 <_ZN11vector_baseIfLj3EE6lengthEv+26>: jne 0x8049d98 <_ZN11vector_baseIfLj3EE6lengthEv+16>0x08049da4 <_ZN11vector_baseIfLj3EE6lengthEv+28>: fld %st(0)0x08049da6 <_ZN11vector_baseIfLj3EE6lengthEv+30>: fsqrt0x08049da8 <_ZN11vector_baseIfLj3EE6lengthEv+32>: fucom %st(0)0x08049daa <_ZN11vector_baseIfLj3EE6lengthEv+34>: fnstsw %ax0x08049dac <_ZN11vector_baseIfLj3EE6lengthEv+36>: and $0x45,%ah0x08049daf <_ZN11vector_baseIfLj3EE6lengthEv+39>: cmp $0x40,%ah0x08049db2 <_ZN11vector_baseIfLj3EE6lengthEv+42>: je 0x8049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>0x08049db4 <_ZN11vector_baseIfLj3EE6lengthEv+44>: fstp %st(0)---Type <return> to continue, or q <return> to quit---0x08049db6 <_ZN11vector_baseIfLj3EE6lengthEv+46>: sub $0x10,%esp0x08049db9 <_ZN11vector_baseIfLj3EE6lengthEv+49>: fstps (%esp)0x08049dbc <_ZN11vector_baseIfLj3EE6lengthEv+52>: call 0x8049898 <_init+920>0x08049dc1 <_ZN11vector_baseIfLj3EE6lengthEv+57>: add $0x10,%esp0x08049dc4 <_ZN11vector_baseIfLj3EE6lengthEv+60>: jmp 0x8049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>0x08049dc6 <_ZN11vector_baseIfLj3EE6lengthEv+62>: fstp %st(1)0x08049dc8 <_ZN11vector_baseIfLj3EE6lengthEv+64>: leave0x08049dc9 <_ZN11vector_baseIfLj3EE6lengthEv+65>: ret
Look, I was just wondering if there was something terriblying wrong in my templated function code. I am not trying to optimize too much, just as much is possible on the go.
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.
Quote:Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.
iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.
Thanks for the link/
Quote:Original post by deavikI thought it might. Good to hear!Quote:Original post by iMalc
Although this link is a little out of date (made for VS6), I do recommend it for learning how to get the most out of optimising vector classes.
With newer versions of VS, the loop unrolling construct of that doc becomes unnecessary.
I'm not sure how it fares with GCC though.
iMalc, actually you were right, the way I was testing wasn't fair. I dug out an old vector class (without templates) and the new code performed at par (it actually worked faster for some cases like dot product). So I'm happy with it for now.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement