Sign in to follow this  
Lode

templates and loop unrolling

Recommended Posts

Compare these two functions: A:
template<int size>
uint32 add(uint32* out, const uint32* a, const uint32* b) //return value is carry bit, out may be the same as a or b
{
  uint32 carry = 0U;
  for(int i = 0; i < size; i++)
  {
    out[i] = a[i] + b[i] + carry;
    if(!carry && out[i] < b[i]) carry = 1U; //if it has overflown
    else if(carry && out[i] <= b[i]) carry = 1U;
    else carry = 0U;
  }
  return carry;
}


B:
uint32 add(uint32* out, const uint32* a, const uint32* b, int size) //return value is carry bit, out may be the same as a or b
{
  uint32 carry = 0U;
  for(int i = 0; i < size; i++)
  {
    out[i] = a[i] + b[i] + carry;
    if(!carry && out[i] < b[i]) carry = 1U; //if it has overflown
    else if(carry && out[i] <= b[i]) carry = 1U;
    else carry = 0U;
  }
  return carry;
}


In the first version, will the compiler be able to compile things more efficiently (faster executable) because of the fact that size is a compile time value instead of a runtime value? And say you'd have an if statement that depends only on a template value "size", will that if statement be done at compile time (= only the "if" or "else" part of it are compiled depending on the value), or still at runtime (where it in fact always does the same...)?

Share this post


Link to post
Share on other sites
For the templated version the compiler will generate ASM code for each different value of size. For small values I think he will be able to fully unroll the for loop so it could possibly get faster. On the other hand you now have larger code which could theoretically be bad for the instruction cache but PCs, which nowadays have large caches, this probably is irrelevant.
A smart compiler could produce identical code to the templated version if it detects that the function is beeing called with a compile time constant for the parameter size. I'm not sure if there is any compiler which will do this.

To be really sure about this you would have to profile this and/or look at the generated ASM. I suggest you use the RDTSC instruction for this.

If you have an "if" statement which depends on a compile time constant the compiler will only compile the branch that will actually get executed and eliminate the "if" statement completly. This is also true for a switch statement which will get eliminated completely if it only depends on a compile time constant.

[Edited by - Trenki on November 25, 2007 9:05:17 AM]

Share this post


Link to post
Share on other sites
Probably, yes. It depends a bit on how many different sizes it's going to be called with (as said above, the code will be compiled once for each size, so it *may* lead to larger code). And loop unrolling isn't always going to be faster, so there may not be a huge performance gain from doing it. However, if size is a compile-time constant, then I'd prefer version A. Visual Studio at least is pretty good at avoiding code bloat with templates, and it does open up some nice possibilities.

Are you looking to optimize the code in general, or just asking about this specific question?

Share this post


Link to post
Share on other sites
Quote:
Original post by SpoonbenderAre you looking to optimize the code in general, or just asking about this specific question?


Sort of general, there are more functions like the above. I was just wondering if it was worth it to make them templates like that (with the disadvantage that the implementation is in the header).

From the answers, I find that it's worth it indeed, I'm really glad that the compilers do this :).

Are you suggesting that you see more things in the code I posted that can be easily optimized?

Share this post


Link to post
Share on other sites
Quote:
Original post by Lode
Quote:
Original post by SpoonbenderAre you looking to optimize the code in general, or just asking about this specific question?


Sort of general, there are more functions like the above. I was just wondering if it was worth it to make them templates like that (with the disadvantage that the implementation is in the header).

From the answers, I find that it's worth it indeed, I'm really glad that the compilers do this :).

Are you suggesting that you see more things in the code I posted that can be easily optimized?


A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)

Lode:
For further optimization, I'd try something like this in the loop body:

// Probably no biggie, but compilers tend to like temporaries better than arrays, because there's no risk of aliasing
uint32 a2 = a[i];
uint32 b2 = b[i];
uint32 out2 = a2 + b2 + carry;

// Conditional move is a lot easier for the compiler (and CPU) to reorder and optimize than a branch
carry = ((!carry && out2 < b2) || (carry && out2 <= b2)) ? 1U : 0U;

out[i] = outtmp;




Might not make a difference in your case, but your version relies on the compiler to 1) be able to eliminate the branches (when I tried something similar a year or two ago, GCC had problems with that, at least), and 2) perform enough aliasing analysis to optimize away all the array accesses.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

But my apologies, I thought your point was that "C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)."

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'd like to point out:

- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
Quote:
Original post by Spoonbender
you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'd like to point out:

- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).


There is no runtime cost for GCP. If the compiler determines that size is a single constant, it will transform

uint32 add(uint32* out, const uint32* a, const uint32* b, int size)

into

uint32 add(uint32* out, const uint32* a, const uint32* b)

and simply use the value of size literally (or refer to it's address if it was complex). If you call add() with multiple size values, even if they are constants or literals, you will get a parameter pass as expected. As such, this isn't a replacement for templates (nor did I ever suggest it was), it's just a nice way to maintain generality and get the benefits of constants wrt to loop unrolling and branch collapsing without having to roll a template for a single instance.

As an aside, a compiler could generate custom variants based on values of size, but there is still no search or runtime cost involved; if you see add(..,..,..,4) you call add_4(), if you see add(..,..,..,SIZE8) you call add_8(), if you see add(..,..,..,var) you call the general version. I've personally never seen a compiler do this, but I've never looked for it either. I wouldn't be surprised if some compilers did do it however.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


It's not that slow to do interprocedural const prop. VC2005 does it. Howevever, it may not choose to propagate constants when code size will increase significantly. Cache DOES make a difference, even these days. If the function is not externally exported, and the constant passed is the same for all calls, the constant can be substituted, potentially leading to loop unrolling opportunities or branch removal. But, in a lot of cases, you'd need to actually duplicate the function in order to be able to do substitution, resulting in increased code size. We did a bunch of experiments with some big big big projects with duplication + substitution and didn't find any appreciable win. Most likely, if the function is a good candidate for duplication it will be a good candidate for inlining, which would have the same effect of branch removal, with the difference that you wouldn't have the function prologue or EH unwind info on x86. You might try to mark the function as __forceinline and see what kind of impact that has.

Share this post


Link to post
Share on other sites
@Author,


Loop unrolling is generally better for performance if used sparingly, the rule of thumb is if it's under 64 iterations and each block is <= 16 instructions, then unroll it.

Share this post


Link to post
Share on other sites
Quote:
Original post by outRider
I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

Sure, but in this case, size isn't a global const. It's a function parameter. I haven't checked, but I wouldn't count on compilers eliminating that.

Quote:
- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

Exactly.
If the loop is only called with one or two (known) values for size, then the compiler could (if you use the templated version), optimize for that explicitly and choose the right version at compile-time, which eliminates the runtime checking. (And perhaps unroll the loop a bit more efficiently, since it doesn't have to take into account that size might not be a multiple of the unroll factor)

Of course, if the function is called with a variable size, the template version is impossible like you say. I was assuming that it'd be pretty much constant, and known at compile-time.

Share this post


Link to post
Share on other sites
Quote:
Original post by Spoonbender
Quote:
Original post by outRider
I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

Sure, but in this case, size isn't a global const. It's a function parameter. I haven't checked, but I wouldn't count on compilers eliminating that.

Quote:
- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

Exactly.
If the loop is only called with one or two (known) values for size, then the compiler could (if you use the templated version), optimize for that explicitly and choose the right version at compile-time, which eliminates the runtime checking. (And perhaps unroll the loop a bit more efficiently, since it doesn't have to take into account that size might not be a multiple of the unroll factor)

Of course, if the function is called with a variable size, the template version is impossible like you say. I was assuming that it'd be pretty much constant, and known at compile-time.


Global const prop doesn't mean global variable, it means globally eliminating the parameter and substituting the constant value passed into the function.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this