templates and loop unrolling

Started by
13 comments, last by gran sveti 16 years, 4 months ago
Compare these two functions: A:
template<int size>
uint32 add(uint32* out, const uint32* a, const uint32* b) //return value is carry bit, out may be the same as a or b
{
  uint32 carry = 0U;
  for(int i = 0; i < size; i++)
  {
    out = a + b + carry;
    if(!carry && out < b) carry = 1U; //if it has overflown
    else if(carry && out <= b) carry = 1U;
    else carry = 0U;
  }
  return carry;
}


B:
uint32 add(uint32* out, const uint32* a, const uint32* b, int size) //return value is carry bit, out may be the same as a or b
{
  uint32 carry = 0U;
  for(int i = 0; i < size; i++)
  {
    out = a + b + carry;
    if(!carry && out < b) carry = 1U; //if it has overflown
    else if(carry && out <= b) carry = 1U;
    else carry = 0U;
  }
  return carry;
}


In the first version, will the compiler be able to compile things more efficiently (faster executable) because of the fact that size is a compile time value instead of a runtime value? And say you'd have an if statement that depends only on a template value "size", will that if statement be done at compile time (= only the "if" or "else" part of it are compiled depending on the value), or still at runtime (where it in fact always does the same...)?
Advertisement
For the templated version the compiler will generate ASM code for each different value of size. For small values I think he will be able to fully unroll the for loop so it could possibly get faster. On the other hand you now have larger code which could theoretically be bad for the instruction cache but PCs, which nowadays have large caches, this probably is irrelevant.
A smart compiler could produce identical code to the templated version if it detects that the function is beeing called with a compile time constant for the parameter size. I'm not sure if there is any compiler which will do this.

To be really sure about this you would have to profile this and/or look at the generated ASM. I suggest you use the RDTSC instruction for this.

If you have an "if" statement which depends on a compile time constant the compiler will only compile the branch that will actually get executed and eliminate the "if" statement completly. This is also true for a switch statement which will get eliminated completely if it only depends on a compile time constant.

[Edited by - Trenki on November 25, 2007 9:05:17 AM]
Probably, yes. It depends a bit on how many different sizes it's going to be called with (as said above, the code will be compiled once for each size, so it *may* lead to larger code). And loop unrolling isn't always going to be faster, so there may not be a huge performance gain from doing it. However, if size is a compile-time constant, then I'd prefer version A. Visual Studio at least is pretty good at avoiding code bloat with templates, and it does open up some nice possibilities.

Are you looking to optimize the code in general, or just asking about this specific question?
Quote:Original post by SpoonbenderAre you looking to optimize the code in general, or just asking about this specific question?


Sort of general, there are more functions like the above. I was just wondering if it was worth it to make them templates like that (with the disadvantage that the implementation is in the header).

From the answers, I find that it's worth it indeed, I'm really glad that the compilers do this :).

Are you suggesting that you see more things in the code I posted that can be easily optimized?
Quote:Original post by Lode
Quote:Original post by SpoonbenderAre you looking to optimize the code in general, or just asking about this specific question?


Sort of general, there are more functions like the above. I was just wondering if it was worth it to make them templates like that (with the disadvantage that the implementation is in the header).

From the answers, I find that it's worth it indeed, I'm really glad that the compilers do this :).

Are you suggesting that you see more things in the code I posted that can be easily optimized?


A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.
Quote:Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)

Lode:
For further optimization, I'd try something like this in the loop body:
 // Probably no biggie, but compilers tend to like temporaries better than arrays, because there's no risk of aliasing    uint32 a2 = a;    uint32 b2 = b;    uint32 out2 = a2 + b2 + carry;    // Conditional move is a lot easier for the compiler (and CPU) to reorder and optimize than a branch    carry = ((!carry && out2 < b2) || (carry && out2 <= b2)) ? 1U : 0U;     out = outtmp;


Might not make a difference in your case, but your version relies on the compiler to 1) be able to eliminate the branches (when I tried something similar a year or two ago, GCC had problems with that, at least), and 2) perform enough aliasing analysis to optimize away all the array accesses.
Quote:Original post by Spoonbender
Quote:Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.
Quote:Original post by outRider
Quote:Original post by Spoonbender
Quote:Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.
Quote:Original post by Spoonbender
Quote:Original post by outRider
Quote:Original post by Spoonbender
Quote:Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

But my apologies, I thought your point was that "C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)."
Quote:Original post by Spoonbender
you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'd like to point out:

- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

This topic is closed to new replies.

Advertisement