# Odd Performance Question

## Recommended Posts

I was researching how I could improve performance. I came across a strange fact. double list[100], sum =0; for(int i= 0; i < 100; i++) sum += list[i]; /////FASTER/////////// double list[100], sum1 = 0, sum2 =0; for(int i = 0, i < 100, i += 2) { sum1 += list[i]; sum2 += list[i+1]; } sum1 += sum2; Is this true? I'm just wondering for fun. Thanks.

##### Share on other sites
I don't think so.. It depends on how the compiler/hardware handles the code.

Looking at the basic side that is

{
sum += list[i]; // 1 Addition + 1 Pointer Redirect
}

200 Additions + 100 Mem Grabs

Against
{
sum1 += list[i];// 1 Addition + 1 Pointer Redirect
sum2 += list[i+1];// 2 Addition + 1 Pointer Redirect
}

sum1 += sum2; // Single addition

201 Additions + 100 Mem Grabs

///////////////
If the CPU/Hardware/OS in case 1 waits for the result of list[i] before increasing the loop (which it must otherwise I changes) opposed to case 2 where the CPU/Hardware/OS can actually do another memory query while waiting for the previous one, that is to say that the second version may be resolved into 1 Pointer Redirect.

I read somewhere a long time ago that if consecutive commands are not interdependant the CPU can actually load them in the CACHE (buffer) at the same time and that saves memory transfer speed. If this is true then case 2 is essentially 201 Adds + 50 mem grabs

##### Share on other sites
Congratulations, you have just re-discovered loop-unrolling. There is some small per-iteration overhead when processing a loop. Your second bit of code has half as many iterations and thus has less overhead. The penalty you pay for this is increased code size and (often) harder to understand code.

You can either do it manually like you've done here or your compiler may have options that will allow it to do this automatically to a certain extent.

As always be sure to profile.

##### Share on other sites
Thanks. I always thought loop-unrolling as x += SomeArray[1], ... x += SomeArray[N}. What do you use to profile your code? I'm looking for something that is free and runs with Windows apps (assuming one exists).

##### Share on other sites
You guys are way off. Its an assembly/pipeline issue. You have to understand the pipeline. There are 5 steps (i think) to issuing a command. If you use a variable while it is in the pipeline, you have to restart the pipeline to make sure it doesn't mess things up.

a += b;
a += c;
a += d;
...
The pipeline has to restart each time "a" changes because "a" is used again.

In the second example:
a += b;
b += c;
a += d;
b += e;
a += b;
This is double-streamed. The pipeline does not have to restart because each line does not interfere with the previous line's variables (except for at the end when they are combined).

This is hardcore assembly optimization and will not amount to anything reasonable. The most important use for this type of double-streaming is for multiplication. Multiplication takes much longer to execute (like 30 times longer than an add). This ensures that the pipeline has time to execute the previous statement... I'm getting to the point where I don't know what I'm talking about. Look up how the pipeline works. I learned a lot of that stuff when I was programming assembly.

##### Share on other sites
Quote:
 Original post by blaze02You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.

For another, that has nothing to do with this post. Loop unrolling is what this is about, and the speed benefit is due to the reduction of conditional jumps. A conditional jump is when the code jumps to a different location depending on the value of something. 'if' statements are conditional jumps, as are 'while' statements, and virtual functions and function pointers result in the same problem. The reason this is slow is that the program does not know which code will be executed until it has a chance to evaluate the condition, and many optimizations are done behind the scenes (such as caching) that depend on the CPU knowing exactly which code will be executed in order. Non-conditional jumps, such as gotos and infinite loops, are perfectly fast.

It's easier to explain with an example. Say we have this C++ code:
for( int N = 0; N < Count; N ++ )    Sum *= Sum;

Here's equivalent assembly:

          mov eax, Sum   // EAX stores Sum          mov ecx, 0     // ECX acts as 'N', initially set to 0          mov ebx, Count // EBX stores Countfor_loop: cmp ecx, ebx   // Compare N to Count          jge done       // CONDITIONAL JUMP. If N >= Count, go to 'done'          imul eax       // multiply Sum with itself          inc ecx        // Increment N          jmp for_loop   // NONCONDITIONAL JUMP back to the for loopdone:     mov Sum, eax   // Copy EAX back into Sum

That 'jge done' is what's called a conditional jump. It's slow and stalls the CPU.

##### Share on other sites
Quote:
 Original post by CGameProgrammer[...]Here's equivalent assembly: mov eax, Sum // EAX stores Sum mov ecx, 0 // ECX acts as 'N', initially set to 0 mov edx, Count // EDX stores Countfor_loop: cmp ecx, edx // Compare N to Count jge done // CONDITIONAL JUMP. If N >= Count, go to 'done' imul eax // multiply Sum with itself inc ecx // Increment N jmp for_loop // NONCONDITIONAL JUMP back to the for loopdone: mov Sum, eax // Copy EAX back into Sum[...]
Not quite. 'imul' stores a result double the bit width of the operand, so when you multiply by the 32-bit 'eax' register, the result goes into the 64-bit 'edx:eax' register pair, which overwrites 'count'. Change all instances of edx to ebx and it should be good.

##### Share on other sites
Quote:
 Original post by ExtrariusNot quite. 'imul' stores a result double the bit width of the operand, so when you multiply by the 32-bit 'eax' register, the result goes into the 64-bit 'edx:eax' register pair, which overwrites 'count'. Change all instances of edx to ebx and it should be good.

Oh yeah, I forgot. (looks sheepish) It's been a while since I wrote optimized assembly.

##### Share on other sites
Quote:
 Original post by blaze02You guys are way off. Its an assembly/pipeline issue.

Nope, and certainly not on modern processors. This pipeline of yours is what most RISC CPUs (MIPS, Alpha, SPARC, etc.) as well as the P5 (Pentium) were like. Starting with the P6 (Pentium Pro, Pentium II and on up) processor now execute instructions out of order, and there is no clear "pipeline." To be more complete, the processor has several pipelines (a few integer pipelines, a few floating point pipelines, etc.) that execute concurrently.

In fact, due to the pipelined execution units, out-of-order execution, and, most importantly, register renaming, modern CPUs can essentially unroll loops on the fly.

The only thing that may hurt is, as the posters immediately above me mentioned, the branches. The CPU must predict whether the branch is taken or not. If it predicts wrong then it will do alot of useless work and waste a lot of time. Due to this speculative execution, you will notice this type of slowdown.

Modern branch prediction, however, is incredibly good (generally on the order of 96-97+% accurate). Performing a loop like that with 100 iterations, the CPU will certainly mispredict the final branch (when the loop is done) and may mispredict the very first branch. Due to how branch prediction works, it will get every other branch correct and unroll the loop dynamically at run time.

So, in short, those two pieces of code will execute in the same time on a modern CPU. In fact, the first one is probably quicker as it gets rid of the final addition. But we're talking a few clock cycles here.

##### Share on other sites
Quote:
 Original post by penwanIn fact, due to the pipelined execution units, out-of-order execution, and, most importantly, register renaming, modern CPUs can essentially unroll loops on the fly.

Yes, but with a lot of restrictions. Compilers often have a hard time figuring out how much it's safe to unroll a loop, because of aliasing and various dependencies.

Anyway, that's beside the point. There certainly *is* a noticeable performance bottleneck in the original version, and it's not the branches. (They might add a handful of cycles because it'll guess wrong in the last and maybe the first iteration)

Floating point addition has a latency of 4+ cycles on modern PC's.
So always adding to the same variable (sum) means you have to wait 4 cycles between each addition. You can't add anything to a variable before you know its value, not even with pipelining and out of order cpu's. So you waste time.
The second version should be quite a bit faster, since you store the temp results in two independent variables. Hence, you don't need to wait for the previous addition to complete before you start on the following one.
You would probably gain a bit more (but not much more) by making 3 or even 4 such temp variables.
Ideally, you'd want to have one addition retire in every cycle. (or two, if your cpu can handle that. I know Athlon 64 can't do that though, and I doubt P4 can either)

In fact, AMD has a similar example in their optimization guide. The only difference is they use multiplication instead of addition, and they say to perform 4 at a time, to match the 4-cycle latency. But the principle is the same. Allow it to perform an operation every cycle, without having to wait for the previous result to complete.

And sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]
Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.

##### Share on other sites
Quote:
 Original post by SpoonbenderAnd sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.

Most of my experience has been with 486 assembly. I haven't read up on the pentium+ pipelines. Yeah, for single loops, the compiler will "guess" the correct next operation like 99% of the time. In fact I wouldn't worry about anything this low level ever; or you might as well be writing assembly code. You are talking 1-2 microseconds difference (AT BEST!) with this optimization. The reason games slow down is an entirely different issue. Spend your time learning what graphic render states take the longest to swap in and out and group accordingly. If you plan on handling a ton of objects, look into the possibility of a hash_map instead of a vector or list. These types of speed-ups you will really be able to see and use.

##### Share on other sites
Quote:
Original post by CGameProgrammer
Quote:
 Original post by blaze02You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.

HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks". And don't ever quote me as saying [irrelevance follows].

##### Share on other sites
Bah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.

##### Share on other sites
Quote:
Original post by blaze02
Quote:
 Original post by SpoonbenderAnd sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.

Most of my experience has been with 486 assembly. I haven't read up on the pentium+ pipelines. Yeah, for single loops, the compiler will "guess" the correct next operation like 99% of the time. In fact I wouldn't worry about anything this low level ever; or you might as well be writing assembly code. You are talking 1-2 microseconds difference (AT BEST!) with this optimization. The reason games slow down is an entirely different issue. Spend your time learning what graphic render states take the longest to swap in and out and group accordingly. If you plan on handling a ton of objects, look into the possibility of a hash_map instead of a vector or list. These types of speed-ups you will really be able to see and use.

Well, this one might be noticeable if you run it in a really tight inner loop. (If we're talking 100 iterations, you won't be able to measure a difference. If this is where 90% of your app's execution time is spent, I can assure you the second version will be a hell of a lot faster than the first) [wink]

##### Share on other sites
Quote:
Original post by blaze02
Quote:
Original post by CGameProgrammer
Quote:
 Original post by blaze02You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.

HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

You'll do well to read the Intel PDF titled "Intel Architecture Optimization", specifically Appendix C.

##### Share on other sites
Quote:
 Original post by blaze02HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks". And don't ever quote me as saying [irrelevance follows].

I remember it being exactly 3 clocks for the Pentium II and above. An integer division is around 20-30 clocks IIRC. Yes, it was really slow on a 486, but modern processors are not overclocked 486s.

##### Share on other sites
Oh man. The amount of disinformation here is worrisome. Thanks to Spoonbender for finally getting it right :)

Quote:
 speed benefit is due to the reduction of conditional jumps. [..] The reason this is slow is that the program does not know which code will be executed until it has a chance to evaluate the condition

Not true. Correctly predicted conditional jumps can be considered to not have *any* overhead at all (due to Branch Target Buffer). It's mispredicting a jump that is the killer.

Quote:
 The only thing that may hurt is, as the posters immediately above me mentioned, the branches.

Noo! We are talking about a loop, which is well-predicted.

Quote:
 please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

It is indeed. 486 had external FPU; it was integrated into the 586 core and has been further improved since (several execution units added).

Quote:
 Bah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.

Nope, loop unrolling was relevant several years ago. The thing to be thinking of nowadays is "software pipelining"; Google turns up nice intro article.
In short it's about minimizing dependencies so you can make use of 'all' the execution units in your N-way superscalar CPU and hide latency. The above example is a very simple case of this.

Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!

##### Share on other sites
Quote:
 Original post by Anon MikeBah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.

Er no. Loop unrolling is exactly what doesn't make much difference.
"Bah, people are always assuming cpu performance is simpler than it is." [wink]
You can unroll the loop as much as you like, and you won't see more than a few percent improvement in performance.
What the OP actually asked about might give you 50-75% extra performance, at a rough estimate.

Quote:
 HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

Sounds about right. Athlon 64 has 4 cycles for fp mul. Keep in mind, this is just the latency before the result is available to other instructions. Of course, the entire isntruction, from fetch to retire, takes quite a bit longer.
And fp math has improved quite a bit since the 486 days (with external fpu's and in-order execution)

Quote:
 Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!

You can get much more than 2.5x in some circumstances.
I did a project at uni last semster where I ended up with around a 10x speed improvement, purely from "minuscule gibberish" like this.
All it takes is a good understanding of cpu architecture and performance.

##### Share on other sites
Quote:
 Original post by blaze02And don't ever quote me as saying [irrelevance follows].

Don't take this the wrong way. What are you going to do about it?

##### Share on other sites
Thanks for all the replies. It got me thinking about other things that I can do to increase my performance.

P.S. My performance bottleneck is not inside my render routines. I simply use OpenGL to draw my data set. Calculating my data set is the hard part. It would normally be done on an FPGA, but I have to use a laptop to do the calcs. I don't think what I'm doing would have been possible a few years ago. However, I think modern CPU's have advanced enough to the point where it can be done; just not easily.

Thanks again.

##### Share on other sites
How is your rendering loop structured?

Doing:

render
update
swapbuffers

can be dramatically faster than the alternatives.

##### Share on other sites
render
update
swapbuffers

Yes, I use this pattern. However, it's the update part that kills me. Using static data just to test the render, I get 0-2%s CPU util.

##### Share on other sites
Got a chance to fully test my code again. I found a peak CPU util of 11% with un-loop and a peak of 30% without it (use nested for loop). It makes a HUGE difference in performance. My code is a little bloated but I got the performance I wanted (would like more increases, of course).

##### Share on other sites
Quote:
Original post by Spoonbender
Quote:
 Original post by Anon MikeBah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.

Er no. Loop unrolling is exactly what doesn't make much difference.
"Bah, people are always assuming cpu performance is simpler than it is." [wink]
You can unroll the loop as much as you like, and you won't see more than a few percent improvement in performance.

I think you misunderstood my point. I'm not denying that pipelining, branch prediction, vectorized operations, yadda yadda yadda don't make a significant difference. My comments were directed at the original code which was a simple instance of loop unrolling and doing this in turn allowed more opportunities for the compiler and the processor to do thier stuff automagically. i.e. I was trying to gently push the newbies rather than tossing them off the deep end. If thier resulting mental model of what's going on isn't quite perfect then that IMHO is ok - they can ask more sophisticated questions when they're ready.

Quote:
 Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!

Meh, I'll take shorter, simpler, easier to understand & maintain code any day unless it's proven that a particular piece of code is in the critical path.

##### Share on other sites
Quote:
 Original post by Anon MikeI think you misunderstood my point. I'm not denying that pipelining, branch prediction, vectorized operations, yadda yadda yadda don't make a significant difference. My comments were directed at the original code which was a simple instance of loop unrolling and doing this in turn allowed more opportunities for the compiler and the processor to do thier stuff automagically.

Nope, I didn't miss your point. It's not loop unrolling though. Take a look at the code again. [smile]

This would be loop unrolling:
Quote:
 for(int i = 0, i < 100, i += 2){sum += list[i];sum += list[i+1];}

and it wouldn't have made a noticeable difference. What made the difference was storing the result into two different variables.

Quote:
 Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!

Quote:
 Meh, I'll take shorter, simpler, easier to understand & maintain code any day unless it's proven that a particular piece of code is in the critical path.

Well, technically it is in the critical path as long as it's executed at some point... But it might not be where most of the execution time is being wasted. And you're right, of course. It's that little "unless" that matters.

fathom88: What does your program do? Sounds like it spends a lot of time in loops like these. What's it for? [wink]

And what do you mean when you talk about cpu utilization? Are we talking abotu the cpu usage reported in task manager, or anything like it?
It should be at 100% unless you do something silly like calling Sleep().

## Create an account

Register a new account

• ## Partner Spotlight

• ### Forum Statistics

• Total Topics
627665
• Total Posts
2978530

• 10
• 10
• 12
• 22
• 13