3D Vector class using template expressions : need help !

Started by
13 comments, last by Bakura 16 years, 10 months ago
Hi everyone ! I read the excellent "C++ Templates - The Complete Guide", and their chapter about expression templates sounded really really great about performance, and because it was extremly interesting, I decided to write my own 3D vector class by imitating their "Array" class, by adding some functions... Maybe their implementation isn't the best, because it looks really complicated, and I'm sure there are manners to do it simpler, I'll try later, but first I'd like to understand well what I'm doing and make some tests about performance. By now, it works really good. Comparing to a "naïv" class template, when I make a lot of operations, mine can be more than 10 times quicker than the normal one, only if I make a lot of operations (let's say 1 million A = B + C for exemple)... But only on GCC (my version is 3.1.2 or something like this). Because with Visual C++ Express 2005, performances are REALLY REALLY bad ! This is the way I take the measures :
struct Trial
{
   Trial () {T = 0;}
   double T;
   double temp;

   double GetTime ()
   {
      LARGE_INTEGER Clock;
      LARGE_INTEGER ClockFreq;
      QueryPerformanceCounter  ( &Clock );
      QueryPerformanceFrequency( &ClockFreq );
      return (Clock.QuadPart) / (double)(ClockFreq.QuadPart);
   }

    void Start ()
    {
        temp = GetTime();
    }

    void End ()
    {
        temp = GetTime() - temp;
        T += temp;
    }
};

int main()
{	
	{
   Trial maStructure;
   maStructure.Start ();
   Vec3 <double> A (1.0, 1.0, 1.0);
   Vec3 <double> B (2.0, 2.0, 2.0);
   Vec3 <double> C (3.0, 3.0, 3.0);

   for (size_t i = 0 ; i != 10000000 ; ++i)
      A = B+C+A;

   maStructure.End ();
   std::cout << "Implémentation Michaël " << maStructure.T;
	}
Maybe it's not the best way to do that, but with a lot of operations, I can do it with a watch or a chrono ! And with Visual Express C++ 2005 it's a lot more fast with the normal way of doing a vector class. for (size_t i = 0 ; i != 10000000 ; ++i) A = B+C+A; With Code::Blocks (GCC 3.1.2) : 0.0710791 (it appears immediatly). With Visual : 0.353937 (I have to wait a little before this appears). All is compiled in released mode, with all the optimizations applied in both compilers... I really don't know why it doesn't work good, because those expressions templates seems really interesting and I really would like to use them later in games, maybe it could boost performance ! Here is the code: http://membres.lycos.fr/hl2connection/VectorClass.7z I know that that kind of programming is really hard to read, but if someone could get a look, or tell me some advices... PS : I can't read ASM, it's like Chinese for me, so don't answer me : "look at the ASM code :nerd:". Thanks !
Advertisement
- Configuration Options
- C/C++
- Code Generation
- Floating Point Model: Fast (/fp:fast)

Your running time should be now identical to that of gcc. In my case it's even 10% faster.
Hej !

Thanks for your answer, but it's not really better. It's a quite the same result (around 5 times faster on GCC than Visual).

EDIT :
- Configuration Options
- C/C++
- Advanced
- Convention d'appel (in fact mine is in French :p Don't know how to say that in english) : __fastcall instead of __cdecl, it's a little faster (around 0.25 instead of 0.35, but still far away from GCC :/.)
What are the flags you're using for each compiler?

EDIT: if you're compiling with optimisations enabled, it's entirely possible that GCC is clever enough to optimise the loop out entirely because there are no observable side effects. I don't know how good it is at that kind of thing, though. You'd have to look at the asm to tell ;)
ASM is Chinese, I can read it but don't understand nothing.

For Visual Express C++ 2005 :

/O2 /Ob1 /Oi /Ot /Oy /GT /GL /FD /EHsc /MT /Gy /arch:SSE /fp:fast /Fo"Release\\" /Fd"Release\vc80.pdb" /W2 /nologo /c /Gr /TP /errorReport:prompt

For GCC :

Optimize fully (O3), Expensive Optimizations (-fexpensive-optimizations)
and a CPU architecture tuning (-march=athlon-xp)
Ok, just for test I set the warning level to the max (4), and I had some strange warnings (I tried to translate the warnings in English between parenthesis):

warning C4512: 'A_Add<T,OP1,OP2>' : l'opérateur d'assignation n'a pas pu être généré ('A_Add<T,OP1,OP2>' : the assigment operator couldn't be generated)
with
[
T=double,
OP1=SVec3<double>,
OP2=SVec3<double>
]

voir la référence à l'instanciation de la classe modèle 'A_Add<T,OP1,OP2>' en cours de compilation (see the reference at the instanciation of the template class 'A_Add<T, OP1, OP2>' while it's compilating)
with
[
T=double,
OP1=SVec3<double>,
OP2=SVec3<double>
]

voir la référence à l'instanciation de la classe modèle 'Vec3<T,Expr>' en cours de compilation (see the reference at the instanciation of the template class 'Vec3<T, Expr>' while it's compilating)
with
[
T=double,
Expr=A_Add<double,SVec3<double>,SVec3<double>>
]

warning C4512: 'A_Add<T,OP1,OP2>' : l'opérateur d'assignation n'a pas pu être généré ('A_Add<T,OP1,OP2>' : the assigment operator couldn't be generated)
with
[
T=double,
OP1=A_Add<double,SVec3<double>,SVec3<double>>,
OP2=SVec3<double>
]

voir la référence à l'instanciation de la classe modèle 'A_Add<T,OP1,OP2>' en cours de compilation (see the reference at the instanciation of the template class 'A_Add<T, OP1, OP2>' while it's compilating)
with
[
T=double,
OP1=A_Add<double,SVec3<double>,SVec3<double>>,
OP2=SVec3<double>
]

voir la référence à l'instanciation de la classe modèle 'Vec3<T,Expr>' en cours de compilation (see the reference at the instanciation of the template class 'Vec3<T, Expr>' while it's compilating)
with
[
T=double,
Expr=A_Add<double,A_Add<double,SVec3<double>,SVec3<double>>,SVec3<double>>
]
I found a solution about the warnings, Visual just needed the operator= for the class A_Add. I added it but it changes simply nothing for the performance :/.

[Edited by - Bakura on June 11, 2007 3:13:44 PM]
Would it be possible to get your source code, in a format that will compile "stand-alone"? It's impossible to verify your benchmarks currently.

You might want to try MSVC with /Ox and /Og. I believe You'll need to pass /LTCG to the linker, too, if you give /Og to the compiler.
The following should give much more accurate time values:
struct Trial{    LARGE_INTEGER StartTime;    LARGE_INTEGER ClockFreq;    double T;    Trial(): T(0.0) {}    void Start()    {        QueryPerformanceFrequency(&ClockFreq);        QueryPerformanceCounter(&StartTime);    }    double End()    {        LARGE_INTEGER EndTime;        QueryPerformanceCounter(&EndTime);        T += double(EndTime.QuadPart - StartTime.QuadPart) / double(ClockFreq.QuadPart);    }};
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Your benchmark is horrible, I wouldn't be surprised if gcc is just removing the loop entirely, something like below is a much better start but still could do with some improvement (of course this type of benchmarking is useless, you should use real world code for tests like this rather then trivial things since this is more about how well the compilers optimizer will work in the situation then anything else.)

int main(){    Timer timer;    // Relatively small array to ensure it all fits in the cache, you may need    // to adjust this depending on your computer    static const int size = 100;    Vec3<double> A[size];    Vec3<double> B[size];    Vec3<double> C[size];    srand(time(0));    for (int i = 0; i < size; ++i)    {        // Initialized with random data to try to prevent the compiler        // optimizing things away        A = Vec3<double>((float)rand(), (float)rand(), (float)rand());        B = Vec3<double>((float)rand(), (float)rand(), (float)rand());        C = Vec3<double>((float)rand(), (float)rand(), (float)rand());    }    timer.Start ();    for (size_t i = 0; i != 10000000; ++i)    {        // Select random values from the array to prevent the compiler        // performing any tricks based on how we loop over the values        // since we will be adding each thing numerous times        // (you may want to vary this to reflect how you would iterate        // over an array in a real application)        int index = rand() % size;        A[index] = A[index] + B[index] + C[index] + A[index];        B[index] = A[index] + B[index] + C[index] + B[index];        C[index] = A[index] + B[index] + C[index] + C[index];    }    timer.End();    // Print the values so the compiler cant just skip the calculation entirely    for (int i = 0; i < size; ++i)    {        std::cout << A << B << C << '\n';    }    std::cout << "Time: " << timer.T;}

This topic is closed to new replies.

Advertisement