Sign in to follow this  

Unity Loop Unrolling

This topic is 3297 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

http://www.gamedev.net/community/forums/topic.asp?topic_id=517021 The above thread pricked my interest, and then got me thinking about loop unrolling. I am just wondering if anyone has thoughts on what kind of benefits can be expected from loop unrolling, is there really much to be gained. And now for some code, there is probably issues all over but it gets the idea across. Its also probably less efficient, but i really can't resist playing with templates :p Any thoughts people.
template <typename T, unsigned unrollCount = 0>
class ArrayAssignmentUnroller
{
	void Assign ( T* Dest, T* Source ) {}
};

template <typename T>
class ArrayAssignmentUnroller<T, 1>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 2>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 3>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 4>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2]; 
		Dest[3] = Source[3]; 
	}
};

template <typename T>
class ArrayAssignmentUnroller<T, 5>
{
public:
	void Assign ( T* Dest, T* Source ) 
	{ 
		Dest[0] = Source[0]; 
		Dest[1] = Source[1]; 
		Dest[2] = Source[2];
		Dest[3] = Source[3]; 
		Dest[4] = Source[4]; 
	}
};

template<typename T>
class UnrollerWrap
{
public:
	void AssignmentIterate(T* dest, T* source, unsigned size)
	{
		int count = 0;
		while( count < size )
		{
			switch( (size - count) % 5)
			{
			case 0:
				r5.Assign(&dest[count], &source[count]);
				count += 5;
				break;
			case 1:
				r1.Assign(&dest[count], &source[count]);
				count += 4;
				break;
			case 2:
				r2.Assign(&dest[count], &source[count]);
				count += 3;
				break;
			case 3:
				r3.Assign(&dest[count], &source[count]);
				count += 2;
				break;
			case 4:
				r4.Assign(&dest[count], &source[count]);
				count += 1;
				break;
			}
                }
	}

private:
	ArrayAssignmentUnroller<T, 1> r1;
	ArrayAssignmentUnroller<T, 2> r2;
	ArrayAssignmentUnroller<T, 3> r3;
	ArrayAssignmentUnroller<T, 4> r4;
	ArrayAssignmentUnroller<T, 5> r5;
};


//Values here are arbitary, just for show.
int _tmain(int argc, _TCHAR* argv[])
{
	int i[] = {1,2,3,1,2,5,5,8,7,6,8,9,9,7,5,4,2,21,6,3,458,4,2,2,3,5,4,8,7,5,9,78,7,5,1,2,56,321};
	int o[30];
	UnrollerWrap<int> uw;
	uw.AssignmentIterate(&o[0], &i[0], 20);
	return 0;




Edit the loop end condition is messed up, will fix now [Edited by - Guthur on December 6, 2008 7:03:16 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Guthur
I am just wondering if anyone has thoughts on what kind of benefits can be expected from loop unrolling, is there really much to be gained.

The benefits (and occasionally, penalties) depend on the code within the loop being unrolled.

The benefits may include:
* reduced number of comparisons
* possibly improvements in cpu usage
* possibly a reduction of cpu instructions
* possibly better pipelining
* possible better optimizing by other systems.

The costs may include:
* increased number of instructions per loop
* possibly reduced usage of cache
* possibly exceeding the cache size on small devices,
* possibly increasing total number cpu instructions
* possibly hindering pipelining
* possibly prohibiting automatic vectorization or pipelining by the optimizer
* It can introduce hard-to-research and subtle issues within the code.

There are many times where loop unwinding is inappropriate.

The optimizer can often unroll loops for you. This is generally a good idea.

When combined with other optimizations (usually only in final/release builds) manual and automatic loop unrolling can introduce subtle isuses.


I had a situation a few years back where loop unrolling combined with a very aggressive optimizer's automatic inlining. This was for an embedded device, where usually it is a good thing to take advantage of all available compiler and linker optimizations. In final builds it would sometimes blow the stack when the function containing the loop unrolling was executed. The unrolled loop called a function several times. That called function required a small array on the stack. When the arrays were automatically inlined they required a considerable amount stack space, occasionally causing a crash. Since this was a final-only bug that only crashed when the stack was already mostly full, it took a long time to track down and fix.

Share this post


Link to post
Share on other sites
It appears to be about 2/3 slower than a normal copy loop. I guess things like doing % a lot are dragging it down + doesn't it interfere with the optimizations the compiler could do itself?

Share this post


Link to post
Share on other sites
When the array size is known at compile time, you can use the following template code to perform the unrolled assignment:
template <class T, int n, int i> struct unrolled {
enum {CTR = i+1};
inline static void assign(T *a, const T *b) { a[i] = b[i]; unrolled<T, n, CTR>::assign(a, b); }
};
template <class T, int n> struct unrolled<T, n, n> {
inline static void assign(T *a, const T *b) {}
};
// ...
int dest[5], source[5];
// ...
unrolled::assign<int, sizeof(dest)/sizeof(dest[0]), 0>(dest, source);




However, I've found previously that in VS2005 and later my attempts to use this were actually slower that just writing a loop and letting the compiler unroll it as it sees fit, even after adding things like:
	#pragma inline_depth( 255 )
#pragma inline_recursion( on )
#pragma auto_inline( on )
#define inline __forceinline

Your results may differ, but you must profile before deciding to use this technique.

Share this post


Link to post
Share on other sites
This is something your compiler will do very easily, especially for your example. Basically, to enforce what frob said...

I wrote the fallowing code for tutoring this kid learning C++. But, I seemed to learn something too! Note that this only compiles with g++ because of the inline assembly.
int main()
{
int x = 0;

char array1[8] = "1234567";
char array2[8] = "0000000";

// This line makes sure the compiler doesn't initialize the arrays
// to be equal. (Down with compiler optimization!)
cout << array1 << ' ' << array2 << endl;

// TRY 1
asm("# COPY START");
int i = 8;
while( i-- )
array2[i] = array1[i];
asm("# COPY END");

// TRY 2
asm(
// array2[0...3] = array1[0...3]
" \t movl (%0), %%eax \n"
"\t\t movl %%eax, (%1) \n"

"\n # Increment the pointers by one long word. \n"
"\t\t addl $4, %0 \n"
"\t\t addl $4, %1 \n"

// array2[4...7] = array1[4...7]
"\t\t movl (%0), %%eax \n"
"\t\t movl %%eax, (%1) \n"
:
: "rS"(array1), "rD"(array2)
: "eax"
);

// TRY 3
asm("# COPY START");
memcpy( array2, array1, 8 );
asm("# COPY END");

cout << array1 << ' ' << array2 << endl;

cout << endl;
}








By examining the assembly each TRY produces, I can see how efficient each technique is. And by writing asm("# comment");, I can tell where the assembly code is when I compile with -S.

TRY 1, the while loop, produced the fallowing asm:
/APP
# 18 "testing.cpp" 1
# COPY START
# 0 "" 2
/NO_APP
movb -21(%ebp), %al
movb %al, -29(%ebp)
movb -22(%ebp), %al
movb %al, -30(%ebp)
movb -23(%ebp), %al
movb %al, -31(%ebp)
movb -24(%ebp), %al
movb %al, -32(%ebp)
movb -25(%ebp), %al
movb %al, -33(%ebp)
movb -26(%ebp), %al
movb %al, -34(%ebp)
movb -27(%ebp), %al
movb %al, -35(%ebp)
movb -28(%ebp), %al
movb %al, -36(%ebp)
/APP
# 22 "testing.cpp" 1
# COPY END
# 0 "" 2
/NO_APP








Loop unrolled. It basically copies every byte from one array to the other, byte by byte. But, I could optimize this by moving two longs from one of the arrays to the next. That's what I did for TRY 2 and yeah, it works. But, I heard compilers are good at this when you use memcpy. TRY 3 looks like this:

	movl	-25(%ebp), %eax
movl -21(%ebp), %edx
movl %eax, -33(%ebp)
movl %edx, -29(%ebp)








Now that's what I'm talking about! In fact, this is actually more optimized than your templated unroller might be. Actually, I believe it's more accurate than mine could POSSIBLY be, syntactically, since they know where the think is in memory without loading its exact address--it knows what offset from the "base pointer" is is, whereas I only know its address by having them load it for me.

Consider this:
template <typename T>
class ArrayAssignmentUnroller<char, 4>
{
public:
void Assign ( char* Dest, char* Source )
{
Dest[0] = Source[0];
Dest[1] = Source[1];
Dest[2] = Source[2];
Dest[3] = Source[3];
}
};








You'll have to tell me if the compiler is smart enough to do
    movl num(%ebp), %eax
movl $eax, num2(%ebp)


But, my point is this: Use memcpy. Or at least memcpy with the third parameter known. Here's what happens when it isn't known.
	movl	-24(%ebp), %ecx
movl -72(%ebp), %edi
movl -68(%ebp), %esi
rep movsb # This is the only instruction that takes any time at all.


But, then again, your example relies on the size being known anyway.

NOTES: My assembly works on any optimization level. My memcpy works on -O1 or greater. The other examples I showed were at -O3 level, the highest one.

[Edited by - Splinter of Chaos on December 8, 2008 1:45:49 AM]

Share this post


Link to post
Share on other sites
Cool feedback guys, all very informative :)


Antheus: Forgot all about Duffs device :|

frob: I would have to actually agree with you that the compiler is probably best left to do this, unless you really really need to do this. Though I did think wrapping it up in a nice template made it very user friendly :p
Thanks for the info as well.

visitor: ya I knew the modulus was a bit of an achelles, i didn't spend much time optimizing, wanted proof of concept first :). Also i had initial intended to do 10 specializations, which should give a speed bust, plus maybe using memcpy (see Splinter of Chaos' post.

iMalc lol i'm jealous i should have thought of recursion :p

Splinter of Chaos: Quality stuff, memcpy would keep the lines of code down as well, its a good suggestion.


I doubt i will find myself in a situation where i need these sort of techniques anytime soon, mores the pity though :\.

I have only recently started using templates and think they provide interesting tools for problem solving, I might be getting a little addicted though :p.


EDIT lol there was 2 massive errors :| fallthrough was not intended and the cases to take care of the remainders were calling the wrong functions :|

Share this post


Link to post
Share on other sites
hhh wee update if anyone is interested and i think i cracked it as well. I think this should beat the for loop. Thanks for the input guys and i would be very thankfull if someone could give it a whirl.

I used memcpy, this actually had a benefit far better than performance it allowed the removal of specializations which was unwieldy with 10 or dare i say it more :p.

Also i removed the modulus from the loop by rounding the count at the start so i could just use 10 until the end. I was basicly performming the operation backwards to what i should have been doing ie testing so i could catch the remainder at the end, silly me better to catch at the start :p.

Oh and obviously i change the looping to use divisons of 10 also added inline, and made the Unroller static, seemed to make sense not sure performance wise :).


template <typename T, unsigned unrollCount>
class ArrayAssignmentUnroller
{
public:
static inline void Assign ( T* Dest, T* Source )
{
memcpy( Dest, Source, CopySize);
}
static unsigned CopySize;
};

template <typename T, unsigned unrollCount> unsigned ArrayAssignmentUnroller<T, unrollCount>::CopySize = sizeof(T) * unrollCount;

template<typename T>
class UnrollerWrap
{
public:
inline void AssignmentIterate(T* dest, T* source, unsigned size)
{
unsigned count = (size) % 10;

switch( count )
{
case 0:
r10.Assign(&dest[0], &source[0]);
count = 10;
break;
case 1:
r1.Assign(&dest[0], &source[0]);
break;
case 2:
r2.Assign(&dest[0], &source[0]);
break;
case 3:
r3.Assign(&dest[0], &source[0]);
break;
case 4:
r4.Assign(&dest[0], &source[0]);
break;
case 5:
r5.Assign(&dest[0], &source[0]);
break;
case 6:
r6.Assign(&dest[0], &source[0]);
break;
case 7:
r7.Assign(&dest[0], &source[0]);
break;
case 8:
r8.Assign(&dest[0], &source[0]);
break;
case 9:
r9.Assign(&dest[0], &source[0]);
break;
}

while( count < size )
{
r10.Assign(&dest[count], &source[count]);
count += 10;
}

}

private:
ArrayAssignmentUnroller<T, 1> r1;
ArrayAssignmentUnroller<T, 2> r2;
ArrayAssignmentUnroller<T, 3> r3;
ArrayAssignmentUnroller<T, 4> r4;
ArrayAssignmentUnroller<T, 5> r5;
ArrayAssignmentUnroller<T, 6> r6;
ArrayAssignmentUnroller<T, 7> r7;
ArrayAssignmentUnroller<T, 8> r8;
ArrayAssignmentUnroller<T, 9> r9;
ArrayAssignmentUnroller<T, 10> r10;
};











EDIT removed a wee bug, they seem to get in everywhere :p

Edit umm it actually may not be as fast :( reading error on my part

Edit ok i'm confused it seems to be faster, 2,000,000 bool assignments For Loop 12ms, LoopUnroller 10ms.

Edit I added 10 more, unroller 5ms, good improvement, i suppose because the while condition is the bottle neck, but its possible i have made a massive mistake somewhere. Time for bed i think, but i have an idea for custom functionality which i might implement tomorrow :p

[Edited by - Guthur on December 6, 2008 9:10:26 PM]

Share this post


Link to post
Share on other sites
I don't think it is a good idea to write a function for an unspecified type T, and then use functions (memcpy) that will not work for all types. Copying the bit pattern is invalid for any type that has a nontrivial operator= or members with nontrivial operator=(also shows no respect to classes which have disabled assignment).

That said, boost does have a type_traits library that lets you identify T's for which memcpy is legal (though it probably doesn't work perfectly and gives false results for some classes to be on the safe side, it probably should at least be able to pick out built-in types).

And you probably should study the Duff's device that was pointed out earlier. If I'm not mistaken the idea is to copy an equal chunk each time, except selecting just once to copy a different amount to accommodate for various sizes.

Share this post


Link to post
Share on other sites
Visitor: Good point, actually i wasn't aware that memcpy required a non-trivial assignment operator, learn something new everyday :) Though to be fair it is an assignment operation unroller :) if someone tries to use it with something that does not do assignment i can't really be held reponsible, crap in crap out :)

I'm going to go back to the specializations and see if it affects performance much this should allow deep copy assignment, I assume thats whats required.

I am going to look at Duffs Device again, I remember reading about it at Uni, i think mine is more readeable though believe it or not :) The loop inside the switch is so easy not to see, and it still looks odd even now :s

Might try the custom functionality as well :)

Thanks for the feedback again much appreciated

Share this post


Link to post
Share on other sites
You might take a look at this: An optimized version of std::copy with boost

And about Duff's device not being readable - it's supposed to be fast not readable. Fast and readable often exclude each other (you should consider if it is worth to make it that fast). However, if the unreadability has a nice interface for the end user and limited to small amounts of code, why not?

Share this post


Link to post
Share on other sites
Indeed, which the template gives you :)

These things really are groovy, I can't believe I avoided them for so long....wasted youth :\ :p


EDIT: Did some further study, and realised it was memcpy that of course was making all the gains :), but i did implement one that could use custom functionality, sort of policy based design, i think :s. But unfortuately i could only get close to the compilers for loop, not quite beat it, 2,000,000 ops for = 302ms, unroller 312ms :| Ah well it was interesting but i think i need to get back to my portfolio now :)

P.S I had thought about integrating some multithreading but realised then it would not really be a fair test against the for loop.

heres alittle code, i removed alot of repetition in the specializations to keep it small, gives the idea at least.


template<typename T>
struct InitialiseDuo
{
inline void DoOp(T &Dest, T &Source)
{
srand(12);
Dest = rand();
Source = rand();
}
};

template <typename T, unsigned unrollCount = 0, template<typename> class Op>
class Unroller
{
public:
__forceinline void Assign ( T* Dest, T* Source )
{
//Oper.DoOp(Dest, Source);
}
Op<T> Oper;
};

template <typename T, template<typename> class Op>
class Unroller<T, 1, Op>
{
public:
__forceinline void Assign ( T* Dest, T* Source )
{
Oper.DoOp(Dest[0], Source[0]);
}
Op<T> Oper;
};

.....****......

template <typename T, template<typename> class Op>
class Unroller<T, 20, Op>
{
public:
__forceinline void Assign ( T* Dest, T* Source )
{
Oper.DoOp(Dest[0], Source[0]);
Oper.DoOp(Dest[1], Source[1]);
Oper.DoOp(Dest[2], Source[2]);
Oper.DoOp(Dest[3], Source[3]);
Oper.DoOp(Dest[4], Source[4]);
Oper.DoOp(Dest[5], Source[5]);
Oper.DoOp(Dest[6], Source[6]);
Oper.DoOp(Dest[7], Source[7]);
Oper.DoOp(Dest[8], Source[8]);
Oper.DoOp(Dest[9], Source[9]);
Oper.DoOp(Dest[10], Source[10]);
Oper.DoOp(Dest[11], Source[11]);
Oper.DoOp(Dest[12], Source[12]);
Oper.DoOp(Dest[13], Source[13]);
Oper.DoOp(Dest[14], Source[14]);
Oper.DoOp(Dest[15], Source[15]);
Oper.DoOp(Dest[16], Source[16]);
Oper.DoOp(Dest[17], Source[17]);
Oper.DoOp(Dest[18], Source[18]);
Oper.DoOp(Dest[19], Source[19]);
}
Op<T> Oper;
};



//template <typename T, unsigned unrollCount, template<typename> class Op> unsigned Unroller<T, unrollCount, Op >::CopySize = sizeof(T) * unrollCount;

template<typename T, template<typename> class Operation >
class UnrollerWrap
{
public:
__forceinline void AssignmentIterate(T* dest, T* source, unsigned size)
{
unsigned count = (size) % 20;

switch( count )
{
case 1:
size -= count;
r1.Assign(&dest[size], &source[size]);
break;
case 2:
size -= count;
r2.Assign(&dest[size], &source[size]);
break;
case 3:
size -= count;
r3.Assign(&dest[size], &source[size]);
break;
case 4:
size -= count;
r4.Assign(&dest[size], &source[size]);
break;
case 5:
size -= count;
r5.Assign(&dest[size], &source[size]);
break;
case 6:
size -= count;
r6.Assign(&dest[size], &source[size]);
break;
case 7:
size -= count;
r7.Assign(&dest[size], &source[size]);
break;
case 8:
size -= count;
r8.Assign(&dest[size], &source[size]);
break;
case 9:
size -= count;
r9.Assign(&dest[size], &source[size]);
break;
case 10:
size -= count;
r10.Assign(&dest[size], &source[size]);
break;
case 11:
size -= count;
r11.Assign(&dest[size], &source[size]);
break;
case 12:
size -= count;
r12.Assign(&dest[size], &source[size]);
break;
case 13:
size -= count;
r13.Assign(&dest[size], &source[size]);
break;
case 14:
size -= count;
r14.Assign(&dest[size], &source[size]);
break;
case 15:
size -= count;
r15.Assign(&dest[size], &source[size]);
break;
case 16:
size -= count;
r16.Assign(&dest[size], &source[size]);
break;
case 17:
size -= count;
r17.Assign(&dest[size], &source[size]);
break;
case 18:
size -= count;
r18.Assign(&dest[size], &source[size]);
break;
case 19:
size -= count;
r19.Assign(&dest[size], &source[size]);
break;
}
count = size / 20;
switch (count % 10)
{
case 0: do { size -= 20; r20.Assign(&dest[size], &source[size]);
case 9: size -= 20; r20.Assign(&dest[size], &source[size]);
case 8: size -= 20; r20.Assign(&dest[size], &source[size]);
case 7: size -= 20; r20.Assign(&dest[size], &source[size]);
case 6: size -= 20; r20.Assign(&dest[size], &source[size]);
case 5: size -= 20; r20.Assign(&dest[size], &source[size]);
case 4: size -= 20; r20.Assign(&dest[size], &source[size]);
case 3: size -= 20; r20.Assign(&dest[size], &source[size]);
case 2: size -= 20; r20.Assign(&dest[size], &source[size]);
case 1: size -= 20; r20.Assign(&dest[size], &source[size]);
} while (size > 0);
}
}

private:
Unroller<T, 1, Operation > r1;
Unroller<T, 2, Operation > r2;
Unroller<T, 3, Operation > r3;
Unroller<T, 4, Operation > r4;
Unroller<T, 5, Operation > r5;
Unroller<T, 6, Operation > r6;
Unroller<T, 7, Operation > r7;
Unroller<T, 8, Operation > r8;
Unroller<T, 9, Operation > r9;
Unroller<T, 10, Operation > r10;
Unroller<T, 11, Operation > r11;
Unroller<T, 12, Operation > r12;
Unroller<T, 13, Operation > r13;
Unroller<T, 14, Operation > r14;
Unroller<T, 15, Operation > r15;
Unroller<T, 16, Operation > r16;
Unroller<T, 17, Operation > r17;
Unroller<T, 18, Operation > r18;
Unroller<T, 19, Operation > r19;
Unroller<T, 20, Operation > r20;
};






EDIT: This was the finalish version, no better really just less code, just put it up for completeness, Compilers are too good (generally) to spend ages trying to beat :p


template<typename T>
struct InitialiseDuo
{
__forceinline void DoOp(T &Dest, T &Source)
{
....
}
};

template <typename T, template<typename> class Op>
class Unroller
{
public:
__forceinline void Assign ( T* &Dest, T* &Source, const unsigned &iterations )
{
switch( iterations )
{
case 40 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 39 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 38 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 37 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 36 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 35 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 34 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 33 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 32 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 31 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 30 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 29 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 28 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 27 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 26 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 25 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 24 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 23 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 22 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 21 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 20 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 19 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 18 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 17 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 16 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 15 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 14 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 13 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 12 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 11 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 10 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 9 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 8 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 7 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 6 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 5 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 4 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 3 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 2 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
case 1 : Oper.DoOp(*Dest, *Source); ++Dest; ++Source;
}
}
Op<T> Oper;
};

const unsigned StepSize = 40;
template<typename T, template<typename> class Operation >
class UnrollerWrap
{
public:
__forceinline void AssignmentIterate(T* dest, T* source, unsigned size)
{
unsigned count = (size) % StepSize;
r1.Assign(dest, source, count);

while( count < size )
{
r1.Assign( dest, source, StepSize );
count += StepSize;
}
}
Unroller<T, Operation > r1;
}





[Edited by - Guthur on December 7, 2008 5:34:47 PM]

Share this post


Link to post
Share on other sites
One way to employ templates might be to let it determine if count is cheaply available or not:


#include <iterator>
#include <vector>
#include <list>
#include <algorithm>
#include <cassert>
#include <iostream>
#include <cstdlib>


template <typename I1, typename I2, typename ItType>
I2 do_copy(I1 first, I1 last, I2 out, ItType)
{
std::cout << "Plain\n";
while(first != last)
{
*out = *first;
++out;
++first;
}
return out;
}

//if source iterators are not random access, last - first would not be available and distance would be slow
template <typename I1, typename I2>
I2 do_copy(I1 first, I1 last, I2 out, std::random_access_iterator_tag)
{
std::cout << "Optimized\n";
int count = last - first;
int n = (count + 7) / 8;
switch ((count) % 8) {
case 0: do { *out++ = *first++;
case 7: *out++ = *first++;
case 6: *out++ = *first++;
case 5: *out++ = *first++;
case 4: *out++ = *first++;
case 3: *out++ = *first++;
case 2: *out++ = *first++;
case 1: *out++ = *first++;
} while (--n > 0);
}
return out;
}

template<typename I1, typename I2>
I2 copy(I1 first, I1 last, I2 out)
{
return do_copy(first, last, out, typename std::iterator_traits<I1>::iterator_category());
}

int main()
{
const unsigned size = 1234;
std::vector<int> a(size);
int b[size];
int c[size];
std::list<int> d;
int e[size];
std::generate_n(b, size, std::rand);
::copy(b, b + size, a.begin());
::copy(a.begin(), a.end(), std::back_inserter(d));
::copy(b, b + size, c);
::copy(d.begin(), d.end(), e);
assert(std::equal(a.begin(), a.end(), b ));
assert(std::equal(a.begin(), a.end(), c ));
assert(std::equal(a.begin(), a.end(), d.begin()));
assert(std::equal(a.begin(), a.end(), e));
}

Share this post


Link to post
Share on other sites
For furthur studying on the idea of fast array copying, I highly recommend a look into the yasli_vector class from the Loki library. It has some brilliant scheme for detecting when it is safe to use memcpy for example, and also essentially allows classes to specify if/how they can be move-constructed, iirc.[cool]

Share this post


Link to post
Share on other sites

This topic is 3297 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this