fastest iteration through an array

Started by
9 comments, last by staticVoid2 15 years, 8 months ago
hi, I'm trying to improve the speed of iterating through a vector/array in my code as I seem to be doing this alot, the usual way i do it is using a for loop as such:


for(int i = 0; i < size; i++)
{
   element = 1;
}

but you can see the problem of creating a temp variable for each loop to access the element temp = (element + i), so I tryed doing it this way:

for(int* i = &element i != element + size; ++i)
{
   *i = 1;
}
which improved the speed quite a bit, the code above (I think) would be equivelant to:

for(std::vector<int>::iterator i = element.begin(); i != element.end(); ++i)
{
   *i = 1;
}
if you were using std::vector. My question is - does the conditional term get evaluated once and stored in a temp of would the .end() function be called on each iteration? I ran a profiler on these three loops to get the time(ticks) each one had taken with the same array size, the results were: ~ 900 - indexing ~ 450 - pointers ~ 13000 - std::vector also any reason why it takes so long in std::vector?
Advertisement
All of the above examples are algorithmically too slow.

Find a qword setmem algorithm, and use that to write value of 0x01010101. SIMD would work as well.

For anything that is larger than L1 cache memory access will be your biggest killer anyway.
Quote:Original post by staticVoid2
also any reason why it takes so long in std::vector?
I'll bet that you're either compiling in debug mode or with runtime iterator checking on.

By the way forcing the compiler to generate a particular sequence of instructions or loop form is usually quite hard. But anyway the follow loop form is usually the fastest in assembly language and is what the AMD (and Intel IIRC) optimization guide recommends:
void iterate_over(const int *array, size_t len) { ptrdiff_t i; if(!len) {  return; } array += len; i = -len; do {  process(array); } while(++i);}


Of course other factors such as cache behavior and, you know, what's actually inside the loop usually makes far more difference. And if the loop overhead is really a problem (and it's usually virtually free on x86 since little beyond painstakingly hand-tuned assembly code can hope to use all of the processor's execution resources) then you can always unroll the loop.

The ideal way also depends on register pressure, size of the array elements (the index method is only suitable for powers-of-two between one and eight bytes on x86/x64), and so on and so forth. Plus other architectures prefer very different code. It's a mess really so I'd just stick to the most natural form and then rewrite any code you truly need to squeeze every last cycle out of in assembly language.

The one easy and universal fix I can think of is that 32-bit indices seem to generate crappy code on x64, at least in GCC, so stick to size_t and the like.

edit: The conditional might or might not be reevaluated every iteration depending on whether the compiler can deduce that the array's size/origin won't change. If they're simply sent as pointer and integer arguments to a function then you're probably fine but if you're being sent a whole std::vector by reference and invoking end() every iteration then you'd probably be better off caching it, the same rule-of-thumb applies even if it's a local vector if you ever pass it by address to another function.
Why don't you just generate the assembler code and check to see what is done in each case?
I know it might sound crazy, but if you are worrying about how to speed up a linear traversal of an array, you might be using the wrong kind of data structure. Algorithmic optimizations here are going to trump any assembly work you might be doing.

Just my .02c
Quote:Original post by staticVoid2

for(std::vector<int>::iterator i = element.begin(); i != element.end(); ++i)
{
*i = 1;
}



Also note that it is evaluating element.end() every loop also. You might want to try:

for (	std::vector<int>::iterator i = element.begin(), end = element.end();	i != end;	++ i )


Get a 2 core machine and process 2 elements in parallel.

Then vectorize your loop so each processor processes 4 elements at a time.

Done. 8x speedup.
Quote:Original post by staticVoid2
any reason why it takes so long in std::vector?
Because you profiled a debug build and/or did not disable iterator checking.
Only profile release builds.

I believe you're taking too narrow a view of your code in deciding that loop iteration is the problem. The time taken perhaps has far more to do with what is really being done in that loop, and how many times the loop is executed. You need to post the real code and also the profiling data, for others that have more experience to interpret.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
I very much doubt that your first two loops result in a noticeable speed difference. As others have said, look at the generated assembly in release mode and see what the difference is.
you cant force the compiler to generate the code you want at this level of minuate. The only possibility is to hand code in assembler.

look at induction variable analysis (index to iterator conversion), loop code hoisting, loop unrolling, rewriting of loop construct jump logic - all very standard back-end compiler passes.

and if you did write this in assembly in many cases the code is likely only optimal for a specific cpu (think variation in code and data cache sizes) even if the architecture is the same.

focus on algorithmic possibilities.

This topic is closed to new replies.

Advertisement