Jump to content
  • Advertisement
Sign in to follow this  
Aressera

MSVC not unrolling certain fixed-length loops

This topic is 595 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have some performance-intensive audio code that uses a fixed-length template array type to implement SSE math operations on arrays of sample data. The class looks something like this:

template < typename T, size_t size >
class Array
{
	public:
		T x[size];
};

template < typename T, size_t size >
inline Array<T,size> operator + ( const Array<T,size>& a, const Array<T,size>& b )
{
	Array<T,size> result;
	for ( size_t i = 0; i < size; i++ )
		result.x[i] = a.x[i] + b.x[i];
	return result;
}

When I compile this code in GCC, the loop is unrolled perfectly in all cases when compiling with all optimizations. However, in MSVC it doesn't unroll the loop if the template type parameter T is not a primitive type. E.g. when T = double it unrolls the loop, but if T = class type then it doesn't. Looking the assembly, it just does a short loop, even though the loop size is 4 and known at compile time. This seems to give about 10x slowdown in my code versus GCC on the same machine. The impact is bad because T is a class that wraps a SSE __m128 vector. What should be just a few unrolled vector instructions gets turned into a monstrous loop, with 4x as many instructions just controlling the loop.

 

I've seen that I can use recursive templates to force the unrolling of these loops, but that solution seems quite ugly, messy, and would make the code for this array about 4x longer (it's already few thousand lines) and much harder to read, all because MSVC can't figure out how to unroll a 4-iteration loop. I have to do the unrolling for a full set of operators and math functions… Is there any way to force MSVC to unroll loops like these in all cases without this madness?

Edited by Aressera

Share this post


Link to post
Share on other sites
Advertisement
Which optimization settings are you using on MSVC?
e.g. /O1 will optimize for size, which will prefer non-unrolled loops.

Share this post


Link to post
Share on other sites

I tried O2 and "full optimization", both do the same thing.

I am trying out the recursive templates and they turned out to not be that verbose if you put all operations in one class, but still it's a pain...

Share this post


Link to post
Share on other sites

Can't help with "class looks like something", if possible try to post the real offending code (of course reduce to minimum necessary to reproduce the issue) . It could be a compiler bug, but maybe there's a little change that will hint the compiler to proper unrolling. Are you sure actually your profiling is correct? A few times happened I was guessing why a optimization was not used and then it resulted that it was not really an optimization but actually reduced performance (by not so much).. but 10 times slowing down, well.. seems mostly a compiler bug. Still there may be a simple line of code that would allow it to correctly hint for unrolling the loop.

?In example seems your working data type is 16 x floating point numbers (since T wrap a SIMD), but I can't be sure because the code is incomplete.

 

Maybe are you missing a align directive? As far as I know GCC do a lot of align on its own, but I don't know if VS do that too (I don't know VS enough)

Edited by DemonDar

Share this post


Link to post
Share on other sites

I can't post the code, I checked alignment on everything. I am looking at the raw assembly output. Funny is that it works with T = SIMDFloat4 (wrapper of __m128), but not T = SIMDArray<Float,4> (array of wrapper of __m128 e.g. nested arrays cause strangeness).

The real culprit of the slowdown is probably more likely a strange "vector constructor" function call (in a loop) that the compiler inserted each time I create an uninitialized array (which was for every single math operation), no idea what it was doing, the assembly is cryptic. To fix this I changed the internal storage type of the array to UByte rather than T and added casts to (T*) everywhere.

 

I finally got it working at GCC speeds with nice-looking assembly generated by using template recusion.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!