• Advertisement

Archived

This topic is now archived and is closed to further replies.

for loops and overhead?

This topic is 5745 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

hi, i´ve programmed a vector template class so that i can reuse my code for 2D, 3D or 4D vectors of different data types. therefore, most of the methods must use a for loop like this:
  
template <unsigned int N, class T = float>
float CVector<N, T>::Dot(const CVector &pVec) const
{
   float result = 0;
   for (unsigned int i = 0; i < N; i++) 
      result += v[i] * pVec.v[i];
   return result;
}
  
my question is: do for loops have any significant overheads / performance drawbacks? thanks Gammastrahler

Share this post


Link to post
Share on other sites
Advertisement
For loops are compiled into conditional jumps which are very fast. I see nothing unneccessary in your function, and every operation you are doing will be fast.

Using the benefits of reusability via templates and writing lean and mean code. That''s why i like writing good C code with C++''s encapsulation.
*sniff* And that''s what game programming should be all about.

Oh, and for further speed questions I advise you learn how to use your compiler/IDE''s profiler. Its a really powerful tool that tells you what areas you need to redesign/optimize, instead of searching line by line for redundancy.

I''ve always lived by the philosophy of write everything lean and clean, and write your graphics/networking code fast and mean.

- Kevin "BaShildy" King
Game Programmer: DigiPen
www.mpogd.com

Share this post


Link to post
Share on other sites
In many cases a for loop is slower than avoiding it altogether. Although it contains only a conditional branch machine code instruction such as branch still takes time.

However, some compilers support an optimization called "loop unrolling". The compiler will attempt to detect whether it can know how many time the loop with execute and if the number is small it will eliminate the loop completely by replacing it with the contents directly. If N was 3 you code would be replaced by the compiler with:

result += v[0] * pVec.v[0];
result += v[1] * pVec.v[1];
result += v[2] * pVec.v[2];

Whether this happens in your case you can check by looking at the resulting assembler code.

Some compilers (such as Intel C++ or Codeplay VectorC) detect such constructions and can compile them to MMX/SSE/3DNow instructions.

Also, if you want to help your compiler to produce faster code (whether it actually helps, depends on the compiler) you make sure it tests against zero instead of N like this:

for (unsigned int i = N-1; i >= 0; --i)

BTW, I think there is a bug in your program, result shouldn't be declared as float, but as T and the same goes for the return type.

[edited by - felonius on June 1, 2002 7:23:58 AM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:

Also, if you want to help your compiler to produce faster code (whether it actually helps, depends on the compiler) you make sure it tests against zero instead of N like this:



Speaking from an x86 point of view, why would this be faster? You have to get the flags set somehow to jump off them. The traditional way you would do:

cmp ebx, ecx
jge
...

And for the one you outlined you would do perhaps:

add ebx, 0
jl
...

Why would the latter be any faster than the former?

Share this post


Link to post
Share on other sites
If you would like the compiler to unroll the loop for you, look into template meta programming. Of course if you are worried about performance, you might want to inline it also.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
"conditional jumps which are very fast"

aren''t they actually among the slowest of instructions? Afterall, a mispredict creates a bubble in the pipeline. Loops that are only 2 to 4 iterations long are the worst too. A two iteration loop would be mispredicted half the time, which would be devastating right?

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
In this specific case, where N will probably be either 2, 3, or 4, isn''t it faster to unroll the loops?

Also, in a function like "Dot" that will be called very often, wouldn''t it be better to create "Result" once as some dummy class member?


Share this post


Link to post
Share on other sites
Two ways to avoid loops in C++:
Template Specialization:

  
template <unsigned int N>
inline unsigned int Factorial(void) {
unsigned int sum = 1;
for(unsigned int a=N; a>1; --a)
sum += a;
return sum;
}

// A specialization for if the factorial of 3 is needed

template <>
inline unsigned int Factorial<3>(void) {
return 6;
}

Template Metaprogramming:

  
template <unsigned int N>
struct Factorial {
static inline unsigned int Get(void) {
return Factorial<N-1>::Get()+N;
}
};

template <>
struct Factorial <1> {
static inline unsigned int Get(void) {
return 1;
}
};

template <>
struct Factorial <0> {
static inline unsigned int Get(void) {
return 0;
}
};

I once wrote a bubble sort using metaprogramming, it''s a lot of fun to play with. I''m sure you can figure out how to add the type template parameter back into that (if your compiler happens to support partial specialization, that is).

Share this post


Link to post
Share on other sites
>> Why would the latter be any faster than the former?

I don''t know about the x86 architecture but on many RISC machines the latter is faster because you just need one less instruction. On, for instance, the Digital Alpha a jump comparison x<3 is this (t0 is x):

ldiq t1, 3 // This is only needed in first iteration.
...
...
cmplt t0, t1, t2
bne t2, mylabel

while the x>=0 can be expressed just as

bge t0, mylabel

and this is clearly faster. Since x86 internally also have to subtract to two values before checking the sign I guess it will be slower there too. Remember on the x86, some instructions are slower than others and comparison between two integer is more complex than comparing a single number against zero (here only the sign bit needs to be tested).

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:

return Factorial<N-1>::Get()+N;



I think you meant

return Factorial<N-1>::Get()*N;

but anyways. So will the compiler essentially pre-compute the factorial function so that it is not done at run-time if given constant input? What happens with


  
int x;

cin >> x;
cout << Factorial<x>::Get() << endl;
cout << Factorial<7>::Get() << endl;


I don''t see how the first call could be expanded, but the second could.

Share this post


Link to post
Share on other sites
which do you think would be faster:

the many loops of the x86

dec ecx
jnz topOfLoop

// maybe?

inc ecx
cmp ebx, ecx
jl topOfLoop // we want the same results as the previous


// or if the loop is small enough
loop topOfLoop

yep there is a single instruction that decs ecx AND handles the conditional jmp, the likly code that the compiler would use (have not checked, ALWAYS check what your compiler produces in release/optimize mode).

with the first two methods you need 2 registers for checkingo f the loop condition, very bad if your code requires a lot of registers. one register can make all the difference since you may need to result to extra push/pop pairs or moving data to allocated memory via mov. many asm coders forget about the loop instruction (useful for only small loops since it can only do short jmps).

Share this post


Link to post
Share on other sites
quote:
Original post by felonius
Also, if you want to help your compiler to produce faster code (whether it actually helps, depends on the compiler) you make sure it tests against zero instead of N like this:

for (unsigned int i = N-1; i >= 0; --i)



...will not work unless you change the counter to a be a signed type.

[edited by - mcfly on June 2, 2002 3:44:50 PM]

Share this post


Link to post
Share on other sites
>> ...will not work unless you change the counter to a be a signed type.

yeah. I overlooked that. He used an unsigned int in the original example and I overlooked that.

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
I think you meant
return Factorial::Get()*N;

Yeah.
quote:
Original post by Anonymous Poster
I don''t see how the first call could be expanded, but the second could.

No, the first example couldn''t be expanded. It''s a poor method of optimizing for a factorial function, it''s much better for things like square root, where you might call it with ''2'' and ''3'' often (you could return constants for those, but run it through ''sqrt'' if there''s no constant stored). The second method is much more adaptable, but is more demanding on the compiler, and overall better.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Am I correct in noting that template meta-programming is only useful when compile-time constants are available? By compile-time constants I mean the dimensions of vectors and matrices or the parameters of a Mathematical expression. I''ve found a couple links that go so far as to represent all of the basic control structures (if, while, for, switch, etc.) in a template meta-language, but aside from that, is my statement correct?

Share this post


Link to post
Share on other sites
quote:
Original post by Anonymous Poster
Am I correct in noting that template meta-programming is only useful when compile-time constants are available? By compile-time constants I mean the dimensions of vectors and matrices or the parameters of a Mathematical expression. I''ve found a couple links that go so far as to represent all of the basic control structures (if, while, for, switch, etc.) in a template meta-language, but aside from that, is my statement correct?

You''re probably correct for most general situations. I wrote a quicksort as a template metaprogram earlier today, but it took an insane amount of time to compile (for more than 10 items in an array it would take minutes, and many megs of RAM (over 100 sometimes), with optimizations turned _off_ in GCC), bloated the executable to an insane size, and was much much slower than a normal recursive quicksort routine (it kills an chance the CPU has at caching, and probably had more function calls that the compiler didn''t manage to work out because optimizations couldn''t be turned on without me running out of RAM during compilation). So, while template metaprogramming has its place, it certainly isn''t always a practical answer. If you actually want the source to the quicksort, I''ll post it.

Share this post


Link to post
Share on other sites
Branches are usually something that can be eliminated (including any loop), and they carry a little bit of overhead, they usually small enough to ignore. On 586+''s they cause a branch misprediction (prior CPUs didn''t pipeline). In the example code above, I wouldn''t be surprized if the loop effectively takes nearly as much time as the math. By unrolling the loop, you take better advantage of the floating point unit''s pipelining capibilities, and can completely avoid the branch misprediction.
On 486- you could eliminate the integer add - on the 586s I think it''ll happen in parallel with the floating operations, so that''s not such a big issue. I''m not sure how the P4 architecture handles this situation.

When branching becomes really bad, is when it causes a cache miss, and it becomes catastrophic if it causes a page fault. Odds are low that a loop with cause either of those.


Magmai Kai Holmlor

"Oh, like you''ve never written buggy code" - Lee

[Look for information | GDNet Start Here | GDNet Search Tool | GDNet FAQ | MSDN RTF[L] | SGI STL Docs | STFW | Asking Smart Questions ]

[Free C++ Libraries | Boost | ACE | Loki | MTL | Blitz++ | wxWindows]

Share this post


Link to post
Share on other sites
quote:
Original post by Null and Void
If you actually want the source to the quicksort, I'll post it.


I, for one, would love to see it. I am interesting in template meta-programming, but I don't know how to do anything with it yet. Seeing a template meta-programmed quicksort would help greatly =)

p.s. maybe you could post it in a new thread so everyone can see it.

[edited by - sjelkjd on June 3, 2002 3:27:55 PM]

Share this post


Link to post
Share on other sites
Writing a quick-sort as a meta-template is serious compiler abuse - good thing it''s not a small fuzzy animal

I imagine it looks something like...

  
template<typename T, int start, int end>
struct QuickSortFromHell
{
void operator()(T* array)
{
if(start == end) return;
QuickSortFromHell<T, start, end/2>()(array);
QuickSortFromHell<T, end/2, end>()(array);
//sort here

int pivot = ...
}
}

Efficient it''s not...

Share this post


Link to post
Share on other sites
Here's my quicksort meta template. It's the first 'more than a bubble sort' sorting algorithm I've done as a meta template, so it probably could have been done more elegantly. I based it directly off a recursive quicksort I found online so that I could later compare them. Now, let's pray the source tag doesn't mangle it too badly .

    
template <class Type, int LHold, int RHold, int Left, int Right, bool LlessR> struct InnerQSort_0;

template <class Type, int LHold, int RHold, int Left, int Right, bool LnotR>
struct InnerQSort_4 {
static inline void Go(Type Data[], Type &Pivot) {
Data[Right] = Data[Left];
InnerQSort_0<Type,LHold,RHold,Left,Right-1, (Left<(Right-1)) >::Go(Data,Pivot);
}
};
template <class Type, int LHold, int RHold, int Left, int Right>
struct InnerQSort_4 <Type, LHold, RHold, Left, Right, false> {
static inline void Go(Type Data[], Type &Pivot) {
InnerQSort_0<Type,LHold,RHold,Left,Right, (Left<Right) >::Go(Data,Pivot);
}
};

template <class Type, int LHold, int RHold, int Left, int Right, bool LlessR>
struct InnerQSort_3 {
static inline void Go(Type Data[], Type &Pivot) {
if(Data[Left] <= Pivot)
InnerQSort_3<Type,LHold,RHold,Left+1,Right, ((Left+1)<Right) >::Go(Data,Pivot);
else
InnerQSort_4<Type,LHold,RHold,Left,Right, (Left!=Right) >::Go(Data,Pivot);
}
};
template <class Type, int LHold, int RHold, int Left, int Right>
struct InnerQSort_3 <Type, LHold, RHold, Left, Right, false> {
static inline void Go(Type Data[], Type &Pivot) {
InnerQSort_4<Type,LHold,RHold,Left,Right, (Left!=Right) >::Go(Data, Pivot);
}
};

template <class Type, int LHold, int RHold, int Left, int Right, bool LnotR>
struct InnerQSort_2 {
static inline void Go(Type Data[], Type &Pivot) {
Data[Left] = Data[Right];
InnerQSort_3<Type,LHold,RHold,Left+1,Right, ((Left+1)<Right) >::Go(Data,Pivot);
}
};
template <class Type, int LHold, int RHold, int Left, int Right>
struct InnerQSort_2 <Type, LHold, RHold, Left, Right, false> {
static inline void Go(Type Data[], Type &Pivot) {
InnerQSort_3<Type,LHold,RHold,Left,Right, (Left<Right) >::Go(Data,Pivot);
}
};

template <class Type, int LHold, int RHold, int Left, int Right, bool LlessR>
struct InnerQSort_1 {
static inline void Go(Type Data[], Type &Pivot) {
if(Data[Right] >= Pivot)
InnerQSort_1<Type,LHold,RHold,Left,Right-1, (Left<(Right-1)) >::Go(Data,Pivot);
else
InnerQSort_2<Type,LHold,RHold,Left,Right, (Left!=Right) >::Go(Data, Pivot);
}
};
template <class Type, int LHold, int RHold, int Left, int Right>
struct InnerQSort_1 <Type, LHold, RHold, Left, Right, false> {
static inline void Go(Type Data[], Type &Pivot) {
InnerQSort_2<Type,LHold,RHold,Left,Right, (Left!=Right) >::Go(Data, Pivot);
}
};

template <class Type, int LHold, int RHold, int Left, int Right, bool LlessR>
struct InnerQSort_0 {
static inline void Go(Type Data[], Type &Pivot) {
InnerQSort_1<Type,LHold,RHold,Left,Right, (Left<Right) >::Go(Data,Pivot);
}
};

template <class Type, int Left, int Right, bool LlessR> struct InnerQSort;

template <class Type, int LHold, int RHold, int Left, int Right>
struct InnerQSort_0 <Type, LHold, RHold, Left, Right, false> {
static inline void Go(Type Data[], Type &Pivot) {
Data[Left] = Pivot;

if(LHold < Left)
InnerQSort<Type, LHold, Left-1, (LHold<(Left-1)) >::Sort(Data);
if(RHold > Left)
InnerQSort<Type, Left+1, RHold, ((Left+1)<RHold) >::Sort(Data);
}
};

template <class Type, int Left, int Right, bool LlessR>
struct InnerQSort {
static inline void Sort(Type Data[]) {
Type Pivot = Data[Left];

InnerQSort_0<Type,Left,Right,Left,Right, (Left<Right) >::Go(Data,Pivot);
}
};

template <class Type, int Left, int Right>
struct InnerQSort <Type, Left, Right, false> {
static inline void Sort(Type Data[]) {
}
};

template <int Size, class Type>
struct QuickSort {
static inline void Sort(Type Data[]) {
InnerQSort<Type,0,Size-1,(0 < (Size-1))>::Sort(Data);
}
};

I know that it compiles in GCC (G++) 2.95.4 and up. But heed that warning I gave earlier, if you add more than 10 items to be sorted it tags a LOT of memory to compile it. You may also need to add something like this to the command line: -ftemplate-depth-2000 (I think that's the correct switch, I can't remember at the moment).



[edited by - Null and Void on June 3, 2002 9:11:00 PM]

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Say I wanted to have a general Vector class that is templated for the data type is stored and the space the vector resides in. If I want to specialize for float/2, float/3, float/4, can I provide an unspecialized interface and in the implementation provide specialized method implementations? An example of what I mean is below.


  
#ifndef VECTOR_H
#define VECTOR_H

template
<typename T, unsigned int D>
class Vector {
public:
Vector();
Vector(const Vector& v);
// ...

T dot(const Vector& v);

private:
T pts[D];
};

template
<typename T, unsigned int D>
Vector<T, D>::Vector()
{
for (unsigned int i = 0; i < D; ++i)
pts[i] = 0.0;
}

template
<>
Vector<float, 2>::Vector()
{
pts[0] = pts[1] = 0.0;
}

template
<typename T, unsigned int D>
Vector<T, D>::Vector(const Vector& v)
{
for (unsigned int i = 0; i < D; ++i)
pts[i] = v.pts[i];
}

template
<>
Vector<float, 2>::Vector(const Vector& v)
{
pts[0] = v.pts[0];
pts[1] = v.pts[1];
}

template
<typename T, unsigned int D>
T Vector<T, D>::dot(const Vector& v)
{
T product = 0.0;
for (unsigned int i = 0; i < D; ++i)
product += pts[i]*v.pts[i];
return product;
}

template
<>
float Vector<float, 2>::dot(const Vector& v) {
return pts[0]*v.pts[0]+pts[1]*v.pts[1];
}



Or do I have to provide a full specialized implementation for float,2? Also what would a non-member (friend) version of dot() look like?

Share this post


Link to post
Share on other sites
According to the spec, you can partially specialize the template for the second parameter, as 2,3,4,... and leave the first one a template parameter.

According to Microsloth, you can''t (yet). Real Soon NowTM you should be able to when they release the next revision of the VS.Net compiler.

If you''re compiling with gcc you can use partial specialization

  
template<typename T>
Vector<T, 2>::Vector()]
{}


And I think dot would look like this

  
template<typename T>
T dot(const Vector<T, 2>& lhs, const Vector<T, 2>& rhs)
{}

but don''t quote me...

Have a look at the MLT (in sig), they have matrix & vector classes & algorithms akin to the stl. And it''s free.

Magmai Kai Holmlor

"Oh, like you''ve never written buggy code" - Lee

[Look for information | GDNet Start Here | GDNet Search Tool | GDNet FAQ | MSDN RTF[L] | SGI STL Docs | STFW | Asking Smart Questions ]

[Free C++ Libraries | Boost | ACE | Loki | MTL | Blitz++ | wxWindows]

Share this post


Link to post
Share on other sites
While we''re on the subject of abusing compilers with template meta programming I might as well contribute with a neat trick for data copying. It should only be used on POD types however since nothing more than a memory copy will be performed.


  
template<class T> void copy(T& a,const T& b)
{
#pragma pack(push,1)

struct POD
{
char data[sizeof(T)];
};

#pragma pack(pop)

*reinterpret_cast<POD*>(&a) = *reinterpret_cast<const POD*>(&b);
}

Share this post


Link to post
Share on other sites

  • Advertisement