fastest iteration through an array

Recommended Posts

hi, I'm trying to improve the speed of iterating through a vector/array in my code as I seem to be doing this alot, the usual way i do it is using a for loop as such:

for(int i = 0; i < size; i++)
{
element[i] = 1;
}


but you can see the problem of creating a temp variable for each loop to access the element temp = (element + i), so I tryed doing it this way:
for(int* i = &element; i != element + size; ++i)
{
*i = 1;
}

which improved the speed quite a bit, the code above (I think) would be equivelant to:
for(std::vector<int>::iterator i = element.begin(); i != element.end(); ++i)
{
*i = 1;
}

if you were using std::vector. My question is - does the conditional term get evaluated once and stored in a temp of would the .end() function be called on each iteration? I ran a profiler on these three loops to get the time(ticks) each one had taken with the same array size, the results were: ~ 900 - indexing ~ 450 - pointers ~ 13000 - std::vector also any reason why it takes so long in std::vector?

Share on other sites
All of the above examples are algorithmically too slow.

Find a qword setmem algorithm, and use that to write value of 0x01010101. SIMD would work as well.

For anything that is larger than L1 cache memory access will be your biggest killer anyway.

Share on other sites
Quote:
 Original post by staticVoid2also any reason why it takes so long in std::vector?
I'll bet that you're either compiling in debug mode or with runtime iterator checking on.

By the way forcing the compiler to generate a particular sequence of instructions or loop form is usually quite hard. But anyway the follow loop form is usually the fastest in assembly language and is what the AMD (and Intel IIRC) optimization guide recommends:
void iterate_over(const int *array, size_t len) { ptrdiff_t i; if(!len) {  return; } array += len; i = -len; do {  process(array[i]); } while(++i);}

Of course other factors such as cache behavior and, you know, what's actually inside the loop usually makes far more difference. And if the loop overhead is really a problem (and it's usually virtually free on x86 since little beyond painstakingly hand-tuned assembly code can hope to use all of the processor's execution resources) then you can always unroll the loop.

The ideal way also depends on register pressure, size of the array elements (the index method is only suitable for powers-of-two between one and eight bytes on x86/x64), and so on and so forth. Plus other architectures prefer very different code. It's a mess really so I'd just stick to the most natural form and then rewrite any code you truly need to squeeze every last cycle out of in assembly language.

The one easy and universal fix I can think of is that 32-bit indices seem to generate crappy code on x64, at least in GCC, so stick to size_t and the like.

edit: The conditional might or might not be reevaluated every iteration depending on whether the compiler can deduce that the array's size/origin won't change. If they're simply sent as pointer and integer arguments to a function then you're probably fine but if you're being sent a whole std::vector by reference and invoking end() every iteration then you'd probably be better off caching it, the same rule-of-thumb applies even if it's a local vector if you ever pass it by address to another function.

Share on other sites
Why don't you just generate the assembler code and check to see what is done in each case?

Share on other sites
I know it might sound crazy, but if you are worrying about how to speed up a linear traversal of an array, you might be using the wrong kind of data structure. Algorithmic optimizations here are going to trump any assembly work you might be doing.

Just my .02c

Share on other sites
Quote:
 Original post by staticVoid2for(std::vector::iterator i = element.begin(); i != element.end(); ++i){ *i = 1;}

Also note that it is evaluating element.end() every loop also. You might want to try:

for (	std::vector<int>::iterator i = element.begin(), end = element.end();	i != end;	++ i )

Share on other sites
Get a 2 core machine and process 2 elements in parallel.

Then vectorize your loop so each processor processes 4 elements at a time.

Done. 8x speedup.

Share on other sites
Quote:
 Original post by staticVoid2any reason why it takes so long in std::vector?
Because you profiled a debug build and/or did not disable iterator checking.
Only profile release builds.

I believe you're taking too narrow a view of your code in deciding that loop iteration is the problem. The time taken perhaps has far more to do with what is really being done in that loop, and how many times the loop is executed. You need to post the real code and also the profiling data, for others that have more experience to interpret.

Share on other sites
I very much doubt that your first two loops result in a noticeable speed difference. As others have said, look at the generated assembly in release mode and see what the difference is.

Share on other sites
you cant force the compiler to generate the code you want at this level of minuate. The only possibility is to hand code in assembler.

look at induction variable analysis (index to iterator conversion), loop code hoisting, loop unrolling, rewriting of loop construct jump logic - all very standard back-end compiler passes.

and if you did write this in assembly in many cases the code is likely only optimal for a specific cpu (think variation in code and data cache sizes) even if the architecture is the same.

focus on algorithmic possibilities.

Share on other sites
Quote:
 Original post by Evil SteveI very much doubt that your first two loops result in a noticeable speed difference.
100% Agreed.
Although some people do argue that arrays allow the compiler to optimize better in some cases, I've never seen any difference in real life.

Any recent and decent compiler will optimize that simple array loop using SIMD where it is legal and appropriate (given the correct compiler switches), so fiddling with assembly is not necessary.
Assembly may even hurt, because not only do you make your code unreadable and unportable, but also hand-generated assembly may very well perform worse than compiler-generated code, unless you have a lot of experience.

The only really significant speedup that you can achieve with hand-written assembly over using a decent compiler in this example is if you use NTA stores, but then again this only works out if you are positive that you won't read that data again soon. Otherwise, it'll be a false friend that makes your code some 50-100 times slower overall (because every single access will be a guaranteed cache miss).

Another remark on the std::vector version: In the particular case of vector<int>, it doesn't really matter, but in general it's better to use for_each, as it doesn't have to call and compare against element.end() a thousand times.
Given the right conditions, for_each will often be able to unroll known-size loops or sometimes even perform the entire calculation at compile time. Some compilers are occasionally successful doing the same with for() too, but it's a lot tougher job for the compiler.

Share on other sites
here's the asm for both:

iterator:

int main(){	int* array = new int[10];	for(int* i = array; i < array + 10; ++i)	{		*i = 0;	}}

_TEXT	SEGMENT$T2542 = -224 ; size = 4_i$2531 = -20						; size = 4_array$= -8 ; size = 4_main PROC ; COMDAT; Line 4 push ebp mov ebp, esp sub esp, 228 ; 000000e4H push ebx push esi push edi lea edi, DWORD PTR [ebp-228] mov ecx, 57 ; 00000039H mov eax, -858993460 ; ccccccccH rep stosd; Line 5 push 40 ; 00000028H call ??2@YAPAXI@Z ; operator new add esp, 4 mov DWORD PTR$T2542[ebp], eax	mov	eax, DWORD PTR $T2542[ebp] mov DWORD PTR _array$[ebp], eax; Line 13	mov	eax, DWORD PTR _array$[ebp] mov DWORD PTR _i$2531[ebp], eax	jmp	SHORT $LN3@main$LN2@main:	mov	eax, DWORD PTR _i$2531[ebp] add eax, 4 mov DWORD PTR _i$2531[ebp], eax$LN3@main: mov eax, DWORD PTR _array$[ebp]	add	eax, 40					; could be avoided	cmp	DWORD PTR _i$2531[ebp], eax jae SHORT$LN4@main; Line 15	mov	eax, DWORD PTR _i$2531[ebp] mov DWORD PTR [eax], 0; Line 16 jmp SHORT$LN2@main$LN4@main:; Line 18 xor eax, eax pop edi pop esi pop ebx add esp, 228 ; 000000e4H cmp ebp, esp call __RTC_CheckEsp mov esp, ebp pop ebp ret 0_main ENDP_TEXT ENDSEND indexing: int main(){ int* array = new int[10]; for(int i = 0; i < 10; ++i) { array[i] = 0; }} _TEXT SEGMENT$T2542 = -224						; size = 4_i$2531 = -20 ; size = 4_array$ = -8						; size = 4_main	PROC						; COMDAT; Line 4	push	ebp	mov	ebp, esp	sub	esp, 228				; 000000e4H	push	ebx	push	esi	push	edi	lea	edi, DWORD PTR [ebp-228]	mov	ecx, 57					; 00000039H	mov	eax, -858993460				; ccccccccH	rep stosd; Line 5	push	40					; 00000028H	call	??2@YAPAXI@Z				; operator new	add	esp, 4	mov	DWORD PTR $T2542[ebp], eax mov eax, DWORD PTR$T2542[ebp]	mov	DWORD PTR _array$[ebp], eax; Line 8 mov DWORD PTR _i$2531[ebp], 0	jmp	SHORT $LN3@main$LN2@main:	mov	eax, DWORD PTR _i$2531[ebp] add eax, 1 mov DWORD PTR _i$2531[ebp], eax$LN3@main: cmp DWORD PTR _i$2531[ebp], 10		; 0000000aH	jge	SHORT $LN4@main; Line 10 mov eax, DWORD PTR _i$2531[ebp]	mov	ecx, DWORD PTR _array$[ebp] mov DWORD PTR [ecx+eax*4], 0; Line 11 jmp SHORT$LN2@main$LN4@main:; Line 18 xor eax, eax pop edi pop esi pop ebx add esp, 228 ; 000000e4H cmp ebp, esp call __RTC_CheckEsp mov esp, ebp pop ebp ret 0_main ENDP_TEXT ENDSEND you can see how the first is faster, also I never knew this was allowed in assembly: mov DWORD PTR [ecx+eax*4], 0 is this not more of a high level langugae? ecx + eax * 4 should it not be mul eax, 4 add ecx, eax mov DWORD PTR [ecx], 0 [Edited by - staticVoid2 on July 25, 2008 6:29:16 AM] Share this post Link to post Share on other sites Here's an example of what I was saying (using gcc, but any other modern compiler should do more or less the same). #include <stdio.h>int main(){ int x[0x1000]; int val = 5; for(unsigned int i = 0; i < 0x1000; ++i) x[i] = val; printf("%d", x[0xfff]); // prevent compiler from dead-striping everything return 0;} -O3 build (i386 code) translates the loop to: L2: movl$5, -4(%edx,%eax,4)
addl $1, %eax cmpl$4097, %eax
jne L2

-O3 -msse2 -ftree-vectorize -march=athlon64 translates to:
movdqa LC0, %xmm0
leal -8(%ebp), %edx
L2:
movaps %xmm0, (%eax)
cmpl %edx, %eax
jne L2

As you can see, it nicely vectorized the loop without any tampering at source level.

Create an account

Register a new account

• Forum Statistics

• Total Topics
628301
• Total Posts
2981913

• 10
• 11
• 11
• 10
• 10