I have some performance-intensive audio code that uses a fixed-length template array type to implement SSE math operations on arrays of sample data. The class looks something like this:
template < typename T, size_t size >
class Array
{
public:
T x[size];
};
template < typename T, size_t size >
inline Array<T,size> operator + ( const Array<T,size>& a, const Array<T,size>& b )
{
Array<T,size> result;
for ( size_t i = 0; i < size; i++ )
result.x[i] = a.x[i] + b.x[i];
return result;
}
When I compile this code in GCC, the loop is unrolled perfectly in all cases when compiling with all optimizations. However, in MSVC it doesn't unroll the loop if the template type parameter T is not a primitive type. E.g. when T = double it unrolls the loop, but if T = class type then it doesn't. Looking the assembly, it just does a short loop, even though the loop size is 4 and known at compile time. This seems to give about 10x slowdown in my code versus GCC on the same machine. The impact is bad because T is a class that wraps a SSE __m128 vector. What should be just a few unrolled vector instructions gets turned into a monstrous loop, with 4x as many instructions just controlling the loop.
I've seen that I can use recursive templates to force the unrolling of these loops, but that solution seems quite ugly, messy, and would make the code for this array about 4x longer (it's already few thousand lines) and much harder to read, all because MSVC can't figure out how to unroll a 4-iteration loop. I have to do the unrolling for a full set of operators and math functions… Is there any way to force MSVC to unroll loops like these in all cases without this madness?