templates and loop unrolling

Started by
13 comments, last by gran sveti 16 years, 5 months ago
Quote:Original post by Zahlman
Quote:Original post by Spoonbender
you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


I'd like to point out:

- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).


There is no runtime cost for GCP. If the compiler determines that size is a single constant, it will transform

uint32 add(uint32* out, const uint32* a, const uint32* b, int size)

into

uint32 add(uint32* out, const uint32* a, const uint32* b)

and simply use the value of size literally (or refer to it's address if it was complex). If you call add() with multiple size values, even if they are constants or literals, you will get a parameter pass as expected. As such, this isn't a replacement for templates (nor did I ever suggest it was), it's just a nice way to maintain generality and get the benefits of constants wrt to loop unrolling and branch collapsing without having to roll a template for a single instance.

As an aside, a compiler could generate custom variants based on values of size, but there is still no search or runtime cost involved; if you see add(..,..,..,4) you call add_4(), if you see add(..,..,..,SIZE8) you call add_8(), if you see add(..,..,..,var) you call the general version. I've personally never seen a compiler do this, but I've never looked for it either. I wouldn't be surprised if some compilers did do it however.
Advertisement
Quote:Original post by Spoonbender
Quote:Original post by outRider
Quote:Original post by Spoonbender
Quote:Original post by outRider
A compiler that does inter-procedural analysis will be able to propagate constants defined in other compilation units, so if size is a constant somewhere in your program you will get the loop unrolling and branch collapsing benefits everywhere it is used. You might check that your compiler does this first.


Unfortunately, C++ compilers generally suck at this. (And of course, C++ itself doesn't exactly make it easy)


MSVC'05 and IBM's XL both do it, I'm sure Intel and Sun do also. GCC probably doesn't do it inter-procedurally yet.


They do some interprocedural optimizations, yes. But it's slow (may be unviable in large projects), and it's far from every kind of optimization they can perform this way. My point was simply that if size is a constant, and you want your code to be optimized, you might as well make it easy for the compiler by making it a template in the first place, rather than hoping that it will templatize your code behind your back by compiling a special optimized version for each value of size.


It's not that slow to do interprocedural const prop. VC2005 does it. Howevever, it may not choose to propagate constants when code size will increase significantly. Cache DOES make a difference, even these days. If the function is not externally exported, and the constant passed is the same for all calls, the constant can be substituted, potentially leading to loop unrolling opportunities or branch removal. But, in a lot of cases, you'd need to actually duplicate the function in order to be able to do substitution, resulting in increased code size. We did a bunch of experiments with some big big big projects with duplication + substitution and didn't find any appreciable win. Most likely, if the function is a good candidate for duplication it will be a good candidate for inlining, which would have the same effect of branch removal, with the difference that you wouldn't have the function prologue or EH unwind info on x86. You might try to mark the function as __forceinline and see what kind of impact that has.
@Author,


Loop unrolling is generally better for performance if used sparingly, the rule of thumb is if it's under 64 iterations and each block is <= 16 instructions, then unroll it.
Quote:Original post by outRider
I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

Sure, but in this case, size isn't a global const. It's a function parameter. I haven't checked, but I wouldn't count on compilers eliminating that.

Quote:- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

Exactly.
If the loop is only called with one or two (known) values for size, then the compiler could (if you use the templated version), optimize for that explicitly and choose the right version at compile-time, which eliminates the runtime checking. (And perhaps unroll the loop a bit more efficiently, since it doesn't have to take into account that size might not be a multiple of the unroll factor)

Of course, if the function is called with a variable size, the template version is impossible like you say. I was assuming that it'd be pretty much constant, and known at compile-time.
Quote:Original post by Spoonbender
Quote:Original post by outRider
I'm not talking about some optimizations, I'm talking about this particular optimization. They both do global const propagation, and I'm sure they're not the only ones.

Sure, but in this case, size isn't a global const. It's a function parameter. I haven't checked, but I wouldn't count on compilers eliminating that.

Quote:- Loop unrolling doesn't require a separate version of the code for each iteration-count.

- Trying to find the correct separate-version at runtime could well (in at least some cases) outweigh the advantages of the optimized code in the first place. Except that it can't even really be done, unless code exists for every possible runtime value (basically impossible for an int parameter).

Exactly.
If the loop is only called with one or two (known) values for size, then the compiler could (if you use the templated version), optimize for that explicitly and choose the right version at compile-time, which eliminates the runtime checking. (And perhaps unroll the loop a bit more efficiently, since it doesn't have to take into account that size might not be a multiple of the unroll factor)

Of course, if the function is called with a variable size, the template version is impossible like you say. I was assuming that it'd be pretty much constant, and known at compile-time.


Global const prop doesn't mean global variable, it means globally eliminating the parameter and substituting the constant value passed into the function.

This topic is closed to new replies.

Advertisement