Loop Unrolling

Started by
12 comments, last by iMalc 15 years, 4 months ago
http://www.gamedev.net/community/forums/topic.asp?topic_id=517021 The above thread pricked my interest, and then got me thinking about loop unrolling. I am just wondering if anyone has thoughts on what kind of benefits can be expected from loop unrolling, is there really much to be gained. And now for some code, there is probably issues all over but it gets the idea across. Its also probably less efficient, but i really can't resist playing with templates :p Any thoughts people.

template <typename T, unsigned unrollCount = 0>
class ArrayAssignmentUnroller
{
	void Assign ( T* Dest, T* Source ) {}
};

template <typename T>
class ArrayAssignmentUnroller<T, 1>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 2>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 3>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 4>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2]; 
		Dest[3] = Source[3]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 5>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2];
		Dest[3] = Source[3]; 
		Dest[4] = Source[4]; 
	}
};

template<typename T>
class UnrollerWrap
{
public:
	void AssignmentIterate(T* dest, T* source, unsigned size)
	{
		int count = 0;
		while( count < size )
		{
			switch( (size - count) % 5)
			{
			case 0:
				r5.Assign(&dest[count], &source[count]);
				count += 5;
				break;
			case 1:
				r1.Assign(&dest[count], &source[count]);
				count += 4;
				break;
			case 2:
				r2.Assign(&dest[count], &source[count]);
				count += 3;
				break;
			case 3:
				r3.Assign(&dest[count], &source[count]);
				count += 2;
				break;
			case 4:
				r4.Assign(&dest[count], &source[count]);
				count += 1;
				break;
			}
                }
	}

private:
	ArrayAssignmentUnroller<T, 1> r1;
	ArrayAssignmentUnroller<T, 2> r2;
	ArrayAssignmentUnroller<T, 3> r3;
	ArrayAssignmentUnroller<T, 4> r4;
	ArrayAssignmentUnroller<T, 5> r5;
};


//Values here are arbitary, just for show.
int _tmain(int argc, _TCHAR* argv[])
{
	int i[] = {1,2,3,1,2,5,5,8,7,6,8,9,9,7,5,4,2,21,6,3,458,4,2,2,3,5,4,8,7,5,9,78,7,5,1,2,56,321};
	int o[30];
	UnrollerWrap<int> uw;
	uw.AssignmentIterate(&o[0], &i[0], 20);
	return 0;




Edit the loop end condition is messed up, will fix now [Edited by - Guthur on December 6, 2008 7:03:16 PM]
Innovation not reiterationIf at any point I look as if I know what I'm doing don't worry it was probably an accident.
Advertisement
Why not just go all the way? No templates needed.
Quote:Original post by Guthur
I am just wondering if anyone has thoughts on what kind of benefits can be expected from loop unrolling, is there really much to be gained.

The benefits (and occasionally, penalties) depend on the code within the loop being unrolled.

The benefits may include:
* reduced number of comparisons
* possibly improvements in cpu usage
* possibly a reduction of cpu instructions
* possibly better pipelining
* possible better optimizing by other systems.

The costs may include:
* increased number of instructions per loop
* possibly reduced usage of cache
* possibly exceeding the cache size on small devices,
* possibly increasing total number cpu instructions
* possibly hindering pipelining
* possibly prohibiting automatic vectorization or pipelining by the optimizer
* It can introduce hard-to-research and subtle issues within the code.

There are many times where loop unwinding is inappropriate.

The optimizer can often unroll loops for you. This is generally a good idea.

When combined with other optimizations (usually only in final/release builds) manual and automatic loop unrolling can introduce subtle isuses.


I had a situation a few years back where loop unrolling combined with a very aggressive optimizer's automatic inlining. This was for an embedded device, where usually it is a good thing to take advantage of all available compiler and linker optimizations. In final builds it would sometimes blow the stack when the function containing the loop unrolling was executed. The unrolled loop called a function several times. That called function required a small array on the stack. When the arrays were automatically inlined they required a considerable amount stack space, occasionally causing a crash. Since this was a final-only bug that only crashed when the stack was already mostly full, it took a long time to track down and fix.
It appears to be about 2/3 slower than a normal copy loop. I guess things like doing % a lot are dragging it down + doesn't it interfere with the optimizations the compiler could do itself?
When the array size is known at compile time, you can use the following template code to perform the unrolled assignment:
template <class T, int n, int i> struct unrolled {	enum {CTR = i+1};	inline static void assign(T *a, const T *b) { a = b; unrolled<T, n, CTR>::assign(a, b); }};template <class T, int n> struct unrolled<T, n, n> {	inline static void assign(T *a, const T *b) {}};// ...int dest[5], source[5];// ...unrolled::assign<int, sizeof(dest)/sizeof(dest[0]), 0>(dest, source);

However, I've found previously that in VS2005 and later my attempts to use this were actually slower that just writing a loop and letting the compiler unroll it as it sees fit, even after adding things like:
	#pragma inline_depth( 255 )	#pragma inline_recursion( on )	#pragma auto_inline( on )	#define inline __forceinline

Your results may differ, but you must profile before deciding to use this technique.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
This is something your compiler will do very easily, especially for your example. Basically, to enforce what frob said...

I wrote the fallowing code for tutoring this kid learning C++. But, I seemed to learn something too! Note that this only compiles with g++ because of the inline assembly.
int main(){    int x = 0;    char array1[8] = "1234567";    char array2[8] = "0000000";    // This line makes sure the compiler doesn't initialize the arrays    // to be equal. (Down with compiler optimization!)    cout << array1 << ' ' << array2 << endl;    // TRY 1    asm("# COPY START");    int i = 8;    while( i-- )        array2 = array1;    asm("# COPY END");    // TRY 2    asm(        // array2[0...3] = array1[0...3]        "  \t movl (%0), %%eax \n"        "\t\t movl %%eax, (%1) \n"        "\n # Increment the pointers by one long word. \n"        "\t\t addl $4, %0      \n"        "\t\t addl $4, %1      \n"        // array2[4...7] = array1[4...7]        "\t\t movl (%0), %%eax \n"        "\t\t movl %%eax, (%1) \n"        :        : "rS"(array1), "rD"(array2)        : "eax"    );        // TRY 3    asm("# COPY START");    memcpy( array2, array1, 8 );    asm("# COPY END");    cout << array1 << ' ' << array2 << endl;        cout << endl;}


By examining the assembly each TRY produces, I can see how efficient each technique is. And by writing asm("# comment");, I can tell where the assembly code is when I compile with -S.

TRY 1, the while loop, produced the fallowing asm:
/APP # 18 "testing.cpp" 1	# COPY START # 0 "" 2/NO_APP	movb	-21(%ebp), %al	movb	%al, -29(%ebp)	movb	-22(%ebp), %al	movb	%al, -30(%ebp)	movb	-23(%ebp), %al	movb	%al, -31(%ebp)	movb	-24(%ebp), %al	movb	%al, -32(%ebp)	movb	-25(%ebp), %al	movb	%al, -33(%ebp)	movb	-26(%ebp), %al	movb	%al, -34(%ebp)	movb	-27(%ebp), %al	movb	%al, -35(%ebp)	movb	-28(%ebp), %al	movb	%al, -36(%ebp)/APP # 22 "testing.cpp" 1	# COPY END # 0 "" 2/NO_APP


Loop unrolled. It basically copies every byte from one array to the other, byte by byte. But, I could optimize this by moving two longs from one of the arrays to the next. That's what I did for TRY 2 and yeah, it works. But, I heard compilers are good at this when you use memcpy. TRY 3 looks like this:

	movl	-25(%ebp), %eax	movl	-21(%ebp), %edx	movl	%eax, -33(%ebp)	movl	%edx, -29(%ebp)


Now that's what I'm talking about! In fact, this is actually more optimized than your templated unroller might be. Actually, I believe it's more accurate than mine could POSSIBLY be, syntactically, since they know where the think is in memory without loading its exact address--it knows what offset from the "base pointer" is is, whereas I only know its address by having them load it for me.

Consider this:
template <typename T>class ArrayAssignmentUnroller<char, 4>{public:	void Assign ( char* Dest, char* Source ) 	{ 		Dest[0] = Source[0]; 		Dest[1] = Source[1]; 		Dest[2] = Source[2]; 		Dest[3] = Source[3]; 	}};


You'll have to tell me if the compiler is smart enough to do
    movl num(%ebp), %eax    movl $eax, num2(%ebp)


But, my point is this: Use memcpy. Or at least memcpy with the third parameter known. Here's what happens when it isn't known.
	movl	-24(%ebp), %ecx	movl	-72(%ebp), %edi	movl	-68(%ebp), %esi	rep movsb # This is the only instruction that takes any time at all.


But, then again, your example relies on the size being known anyway.

NOTES: My assembly works on any optimization level. My memcpy works on -O1 or greater. The other examples I showed were at -O3 level, the highest one.

[Edited by - Splinter of Chaos on December 8, 2008 1:45:49 AM]
Cool feedback guys, all very informative :)


Antheus: Forgot all about Duffs device :|

frob: I would have to actually agree with you that the compiler is probably best left to do this, unless you really really need to do this. Though I did think wrapping it up in a nice template made it very user friendly :p
Thanks for the info as well.

visitor: ya I knew the modulus was a bit of an achelles, i didn't spend much time optimizing, wanted proof of concept first :). Also i had initial intended to do 10 specializations, which should give a speed bust, plus maybe using memcpy (see Splinter of Chaos' post.

iMalc lol i'm jealous i should have thought of recursion :p

Splinter of Chaos: Quality stuff, memcpy would keep the lines of code down as well, its a good suggestion.


I doubt i will find myself in a situation where i need these sort of techniques anytime soon, mores the pity though :\.

I have only recently started using templates and think they provide interesting tools for problem solving, I might be getting a little addicted though :p.


EDIT lol there was 2 massive errors :| fallthrough was not intended and the cases to take care of the remainders were calling the wrong functions :|
Innovation not reiterationIf at any point I look as if I know what I'm doing don't worry it was probably an accident.
hhh wee update if anyone is interested and i think i cracked it as well. I think this should beat the for loop. Thanks for the input guys and i would be very thankfull if someone could give it a whirl.

I used memcpy, this actually had a benefit far better than performance it allowed the removal of specializations which was unwieldy with 10 or dare i say it more :p.

Also i removed the modulus from the loop by rounding the count at the start so i could just use 10 until the end. I was basicly performming the operation backwards to what i should have been doing ie testing so i could catch the remainder at the end, silly me better to catch at the start :p.

Oh and obviously i change the looping to use divisons of 10 also added inline, and made the Unroller static, seemed to make sense not sure performance wise :).

template <typename T, unsigned unrollCount>class ArrayAssignmentUnroller{public:	static inline void Assign ( T* Dest, T* Source ) 	{		memcpy( Dest, Source, CopySize);	}	static unsigned CopySize;};template <typename T, unsigned unrollCount> unsigned ArrayAssignmentUnroller<T, unrollCount>::CopySize = sizeof(T) * unrollCount;template<typename T>class UnrollerWrap{public:	inline void AssignmentIterate(T* dest, T* source, unsigned size)	{		unsigned count = (size) % 10;				switch( count )		{		case 0:			r10.Assign(&dest[0], &source[0]);			count = 10;			break;		case 1:			r1.Assign(&dest[0], &source[0]);			break;		case 2:			r2.Assign(&dest[0], &source[0]);			break;		case 3:			r3.Assign(&dest[0], &source[0]);			break;		case 4:			r4.Assign(&dest[0], &source[0]);			break;		case 5:			r5.Assign(&dest[0], &source[0]);			break;		case 6:			r6.Assign(&dest[0], &source[0]);			break;		case 7:			r7.Assign(&dest[0], &source[0]);			break;		case 8:			r8.Assign(&dest[0], &source[0]);			break;		case 9:			r9.Assign(&dest[0], &source[0]);			break;		}				while( count < size )		{			r10.Assign(&dest[count], &source[count]);			count += 10;		}					}private:	ArrayAssignmentUnroller<T, 1> r1;	ArrayAssignmentUnroller<T, 2> r2;	ArrayAssignmentUnroller<T, 3> r3;	ArrayAssignmentUnroller<T, 4> r4;	ArrayAssignmentUnroller<T, 5> r5;	ArrayAssignmentUnroller<T, 6> r6;	ArrayAssignmentUnroller<T, 7> r7;	ArrayAssignmentUnroller<T, 8> r8;	ArrayAssignmentUnroller<T, 9> r9;	ArrayAssignmentUnroller<T, 10> r10;};


EDIT removed a wee bug, they seem to get in everywhere :p

Edit umm it actually may not be as fast :( reading error on my part

Edit ok i'm confused it seems to be faster, 2,000,000 bool assignments For Loop 12ms, LoopUnroller 10ms.

Edit I added 10 more, unroller 5ms, good improvement, i suppose because the while condition is the bottle neck, but its possible i have made a massive mistake somewhere. Time for bed i think, but i have an idea for custom functionality which i might implement tomorrow :p

[Edited by - Guthur on December 6, 2008 9:10:26 PM]
Innovation not reiterationIf at any point I look as if I know what I'm doing don't worry it was probably an accident.
I don't think it is a good idea to write a function for an unspecified type T, and then use functions (memcpy) that will not work for all types. Copying the bit pattern is invalid for any type that has a nontrivial operator= or members with nontrivial operator=(also shows no respect to classes which have disabled assignment).

That said, boost does have a type_traits library that lets you identify T's for which memcpy is legal (though it probably doesn't work perfectly and gives false results for some classes to be on the safe side, it probably should at least be able to pick out built-in types).

And you probably should study the Duff's device that was pointed out earlier. If I'm not mistaken the idea is to copy an equal chunk each time, except selecting just once to copy a different amount to accommodate for various sizes.
Visitor: Good point, actually i wasn't aware that memcpy required a non-trivial assignment operator, learn something new everyday :) Though to be fair it is an assignment operation unroller :) if someone tries to use it with something that does not do assignment i can't really be held reponsible, crap in crap out :)

I'm going to go back to the specializations and see if it affects performance much this should allow deep copy assignment, I assume thats whats required.

I am going to look at Duffs Device again, I remember reading about it at Uni, i think mine is more readeable though believe it or not :) The loop inside the switch is so easy not to see, and it still looks odd even now :s

Might try the custom functionality as well :)

Thanks for the feedback again much appreciated
Innovation not reiterationIf at any point I look as if I know what I'm doing don't worry it was probably an accident.

This topic is closed to new replies.

Advertisement