Sign in to follow this  
staticVoid2

fastest iteration through an array

Recommended Posts

hi, I'm trying to improve the speed of iterating through a vector/array in my code as I seem to be doing this alot, the usual way i do it is using a for loop as such:

for(int i = 0; i < size; i++)
{
   element[i] = 1;
}

but you can see the problem of creating a temp variable for each loop to access the element temp = (element + i), so I tryed doing it this way:
for(int* i = &element; i != element + size; ++i)
{
   *i = 1;
}
which improved the speed quite a bit, the code above (I think) would be equivelant to:
for(std::vector<int>::iterator i = element.begin(); i != element.end(); ++i)
{
   *i = 1;
}
if you were using std::vector. My question is - does the conditional term get evaluated once and stored in a temp of would the .end() function be called on each iteration? I ran a profiler on these three loops to get the time(ticks) each one had taken with the same array size, the results were: ~ 900 - indexing ~ 450 - pointers ~ 13000 - std::vector also any reason why it takes so long in std::vector?

Share this post


Link to post
Share on other sites
All of the above examples are algorithmically too slow.

Find a qword setmem algorithm, and use that to write value of 0x01010101. SIMD would work as well.

For anything that is larger than L1 cache memory access will be your biggest killer anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by staticVoid2
also any reason why it takes so long in std::vector?
I'll bet that you're either compiling in debug mode or with runtime iterator checking on.

By the way forcing the compiler to generate a particular sequence of instructions or loop form is usually quite hard. But anyway the follow loop form is usually the fastest in assembly language and is what the AMD (and Intel IIRC) optimization guide recommends:
void iterate_over(const int *array, size_t len) {
ptrdiff_t i;

if(!len) {
return;
}

array += len;
i = -len;

do {
process(array[i]);
} while(++i);
}


Of course other factors such as cache behavior and, you know, what's actually inside the loop usually makes far more difference. And if the loop overhead is really a problem (and it's usually virtually free on x86 since little beyond painstakingly hand-tuned assembly code can hope to use all of the processor's execution resources) then you can always unroll the loop.

The ideal way also depends on register pressure, size of the array elements (the index method is only suitable for powers-of-two between one and eight bytes on x86/x64), and so on and so forth. Plus other architectures prefer very different code. It's a mess really so I'd just stick to the most natural form and then rewrite any code you truly need to squeeze every last cycle out of in assembly language.

The one easy and universal fix I can think of is that 32-bit indices seem to generate crappy code on x64, at least in GCC, so stick to size_t and the like.

edit: The conditional might or might not be reevaluated every iteration depending on whether the compiler can deduce that the array's size/origin won't change. If they're simply sent as pointer and integer arguments to a function then you're probably fine but if you're being sent a whole std::vector by reference and invoking end() every iteration then you'd probably be better off caching it, the same rule-of-thumb applies even if it's a local vector if you ever pass it by address to another function.

Share this post


Link to post
Share on other sites
I know it might sound crazy, but if you are worrying about how to speed up a linear traversal of an array, you might be using the wrong kind of data structure. Algorithmic optimizations here are going to trump any assembly work you might be doing.

Just my .02c

Share this post


Link to post
Share on other sites
Quote:
Original post by staticVoid2

for(std::vector<int>::iterator i = element.begin(); i != element.end(); ++i)
{
*i = 1;
}



Also note that it is evaluating element.end() every loop also. You might want to try:


for ( std::vector<int>::iterator i = element.begin(), end = element.end();
i != end;
++ i )



Share this post


Link to post
Share on other sites
Quote:
Original post by staticVoid2
any reason why it takes so long in std::vector?
Because you profiled a debug build and/or did not disable iterator checking.
Only profile release builds.

I believe you're taking too narrow a view of your code in deciding that loop iteration is the problem. The time taken perhaps has far more to do with what is really being done in that loop, and how many times the loop is executed. You need to post the real code and also the profiling data, for others that have more experience to interpret.

Share this post


Link to post
Share on other sites
I very much doubt that your first two loops result in a noticeable speed difference. As others have said, look at the generated assembly in release mode and see what the difference is.

Share this post


Link to post
Share on other sites
you cant force the compiler to generate the code you want at this level of minuate. The only possibility is to hand code in assembler.

look at induction variable analysis (index to iterator conversion), loop code hoisting, loop unrolling, rewriting of loop construct jump logic - all very standard back-end compiler passes.

and if you did write this in assembly in many cases the code is likely only optimal for a specific cpu (think variation in code and data cache sizes) even if the architecture is the same.

focus on algorithmic possibilities.

Share this post


Link to post
Share on other sites
Quote:
Original post by Evil Steve
I very much doubt that your first two loops result in a noticeable speed difference.
100% Agreed.
Although some people do argue that arrays allow the compiler to optimize better in some cases, I've never seen any difference in real life.

Any recent and decent compiler will optimize that simple array loop using SIMD where it is legal and appropriate (given the correct compiler switches), so fiddling with assembly is not necessary.
Assembly may even hurt, because not only do you make your code unreadable and unportable, but also hand-generated assembly may very well perform worse than compiler-generated code, unless you have a lot of experience.

The only really significant speedup that you can achieve with hand-written assembly over using a decent compiler in this example is if you use NTA stores, but then again this only works out if you are positive that you won't read that data again soon. Otherwise, it'll be a false friend that makes your code some 50-100 times slower overall (because every single access will be a guaranteed cache miss).

Another remark on the std::vector version: In the particular case of vector<int>, it doesn't really matter, but in general it's better to use for_each, as it doesn't have to call and compare against element.end() a thousand times.
Given the right conditions, for_each will often be able to unroll known-size loops or sometimes even perform the entire calculation at compile time. Some compilers are occasionally successful doing the same with for() too, but it's a lot tougher job for the compiler.

Share this post


Link to post
Share on other sites
here's the asm for both:

iterator:




int main()
{

int* array = new int[10];

for(int* i = array; i < array + 10; ++i)
{
*i = 0;
}

}





_TEXT SEGMENT
$T2542 = -224 ; size = 4
_i$2531 = -20 ; size = 4
_array$ = -8 ; size = 4
_main PROC ; COMDAT
; Line 4
push ebp
mov ebp, esp
sub esp, 228 ; 000000e4H
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-228]
mov ecx, 57 ; 00000039H
mov eax, -858993460 ; ccccccccH
rep stosd
; Line 5
push 40 ; 00000028H
call ??2@YAPAXI@Z ; operator new
add esp, 4
mov DWORD PTR $T2542[ebp], eax
mov eax, DWORD PTR $T2542[ebp]
mov DWORD PTR _array$[ebp], eax
; Line 13
mov eax, DWORD PTR _array$[ebp]
mov DWORD PTR _i$2531[ebp], eax
jmp SHORT $LN3@main
$LN2@main:
mov eax, DWORD PTR _i$2531[ebp]
add eax, 4
mov DWORD PTR _i$2531[ebp], eax
$LN3@main:
mov eax, DWORD PTR _array$[ebp]
add eax, 40 ; could be avoided
cmp DWORD PTR _i$2531[ebp], eax
jae SHORT $LN4@main
; Line 15
mov eax, DWORD PTR _i$2531[ebp]
mov DWORD PTR [eax], 0
; Line 16
jmp SHORT $LN2@main
$LN4@main:
; Line 18
xor eax, eax
pop edi
pop esi
pop ebx
add esp, 228 ; 000000e4H
cmp ebp, esp
call __RTC_CheckEsp
mov esp, ebp
pop ebp
ret 0
_main ENDP
_TEXT ENDS
END







indexing:



int main()
{
int* array = new int[10];


for(int i = 0; i < 10; ++i)
{
array[i] = 0;
}
}






_TEXT SEGMENT
$T2542 = -224 ; size = 4
_i$2531 = -20 ; size = 4
_array$ = -8 ; size = 4
_main PROC ; COMDAT
; Line 4
push ebp
mov ebp, esp
sub esp, 228 ; 000000e4H
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-228]
mov ecx, 57 ; 00000039H
mov eax, -858993460 ; ccccccccH
rep stosd
; Line 5
push 40 ; 00000028H
call ??2@YAPAXI@Z ; operator new
add esp, 4
mov DWORD PTR $T2542[ebp], eax
mov eax, DWORD PTR $T2542[ebp]
mov DWORD PTR _array$[ebp], eax
; Line 8
mov DWORD PTR _i$2531[ebp], 0
jmp SHORT $LN3@main
$LN2@main:
mov eax, DWORD PTR _i$2531[ebp]
add eax, 1
mov DWORD PTR _i$2531[ebp], eax
$LN3@main:
cmp DWORD PTR _i$2531[ebp], 10 ; 0000000aH
jge SHORT $LN4@main
; Line 10
mov eax, DWORD PTR _i$2531[ebp]
mov ecx, DWORD PTR _array$[ebp]
mov DWORD PTR [ecx+eax*4], 0
; Line 11
jmp SHORT $LN2@main
$LN4@main:
; Line 18
xor eax, eax
pop edi
pop esi
pop ebx
add esp, 228 ; 000000e4H
cmp ebp, esp
call __RTC_CheckEsp
mov esp, ebp
pop ebp
ret 0
_main ENDP
_TEXT ENDS
END







you can see how the first is faster, also I never knew this was allowed in assembly:

mov DWORD PTR [ecx+eax*4], 0

is this not more of a high level langugae? ecx + eax * 4 should it not be

mul eax, 4
add ecx, eax
mov DWORD PTR [ecx], 0

[Edited by - staticVoid2 on July 25, 2008 6:29:16 AM]

Share this post


Link to post
Share on other sites
Here's an example of what I was saying (using gcc, but any other modern compiler should do more or less the same).

#include <stdio.h>
int main()
{
int x[0x1000];
int val = 5;
for(unsigned int i = 0; i < 0x1000; ++i)
x[i] = val;

printf("%d", x[0xfff]); // prevent compiler from dead-striping everything
return 0;
}


-O3 build (i386 code) translates the loop to:
L2:
movl $5, -4(%edx,%eax,4)
addl $1, %eax
cmpl $4097, %eax
jne L2


-O3 -msse2 -ftree-vectorize -march=athlon64 translates to:
movdqa LC0, %xmm0
leal -8(%ebp), %edx
L2:
movaps %xmm0, (%eax)
addl $16, %eax
cmpl %edx, %eax
jne L2

As you can see, it nicely vectorized the loop without any tampering at source level.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this