MSVC not unrolling certain fixed-length loops

Started by
3 comments, last by Aressera 7 years, 2 months ago

I have some performance-intensive audio code that uses a fixed-length template array type to implement SSE math operations on arrays of sample data. The class looks something like this:


template < typename T, size_t size >
class Array
{
	public:
		T x[size];
};

template < typename T, size_t size >
inline Array<T,size> operator + ( const Array<T,size>& a, const Array<T,size>& b )
{
	Array<T,size> result;
	for ( size_t i = 0; i < size; i++ )
		result.x[i] = a.x[i] + b.x[i];
	return result;
}

When I compile this code in GCC, the loop is unrolled perfectly in all cases when compiling with all optimizations. However, in MSVC it doesn't unroll the loop if the template type parameter T is not a primitive type. E.g. when T = double it unrolls the loop, but if T = class type then it doesn't. Looking the assembly, it just does a short loop, even though the loop size is 4 and known at compile time. This seems to give about 10x slowdown in my code versus GCC on the same machine. The impact is bad because T is a class that wraps a SSE __m128 vector. What should be just a few unrolled vector instructions gets turned into a monstrous loop, with 4x as many instructions just controlling the loop.

I've seen that I can use recursive templates to force the unrolling of these loops, but that solution seems quite ugly, messy, and would make the code for this array about 4x longer (it's already few thousand lines) and much harder to read, all because MSVC can't figure out how to unroll a 4-iteration loop. I have to do the unrolling for a full set of operators and math functions… Is there any way to force MSVC to unroll loops like these in all cases without this madness?

Advertisement
Which optimization settings are you using on MSVC?
e.g. /O1 will optimize for size, which will prefer non-unrolled loops.

I tried O2 and "full optimization", both do the same thing.

I am trying out the recursive templates and they turned out to not be that verbose if you put all operations in one class, but still it's a pain...

Can't help with "class looks like something", if possible try to post the real offending code (of course reduce to minimum necessary to reproduce the issue) . It could be a compiler bug, but maybe there's a little change that will hint the compiler to proper unrolling. Are you sure actually your profiling is correct? A few times happened I was guessing why a optimization was not used and then it resulted that it was not really an optimization but actually reduced performance (by not so much).. but 10 times slowing down, well.. seems mostly a compiler bug. Still there may be a simple line of code that would allow it to correctly hint for unrolling the loop.

?In example seems your working data type is 16 x floating point numbers (since T wrap a SIMD), but I can't be sure because the code is incomplete.

Maybe are you missing a align directive? As far as I know GCC do a lot of align on its own, but I don't know if VS do that too (I don't know VS enough)

I can't post the code, I checked alignment on everything. I am looking at the raw assembly output. Funny is that it works with T = SIMDFloat4 (wrapper of __m128), but not T = SIMDArray<Float,4> (array of wrapper of __m128 e.g. nested arrays cause strangeness).

The real culprit of the slowdown is probably more likely a strange "vector constructor" function call (in a loop) that the compiler inserted each time I create an uninitialized array (which was for every single math operation), no idea what it was doing, the assembly is cryptic. To fix this I changed the internal storage type of the array to UByte rather than T and added casts to (T*) everywhere.

I finally got it working at GCC speeds with nice-looking assembly generated by using template recusion.

This topic is closed to new replies.

Advertisement