Jump to content
  • Advertisement
Sign in to follow this  
fathom88

Odd Performance Question

This topic is 4485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I was researching how I could improve performance. I came across a strange fact. double list[100], sum =0; for(int i= 0; i < 100; i++) sum += list; /////FASTER/////////// double list[100], sum1 = 0, sum2 =0; for(int i = 0, i < 100, i += 2) { sum1 += list; sum2 += list[i+1]; } sum1 += sum2; Is this true? I'm just wondering for fun. Thanks.

Share this post


Link to post
Share on other sites
Advertisement
I don't think so.. It depends on how the compiler/hardware handles the code.

Looking at the basic side that is

i++; // 1 Addition
{
sum += list; // 1 Addition + 1 Pointer Redirect
}

200 Additions + 100 Mem Grabs

Against
i+2; // 1 addition
{
sum1 += list;// 1 Addition + 1 Pointer Redirect
sum2 += list[i+1];// 2 Addition + 1 Pointer Redirect
}

sum1 += sum2; // Single addition

201 Additions + 100 Mem Grabs


///////////////
If the CPU/Hardware/OS in case 1 waits for the result of list before increasing the loop (which it must otherwise I changes) opposed to case 2 where the CPU/Hardware/OS can actually do another memory query while waiting for the previous one, that is to say that the second version may be resolved into 1 Pointer Redirect.

I read somewhere a long time ago that if consecutive commands are not interdependant the CPU can actually load them in the CACHE (buffer) at the same time and that saves memory transfer speed. If this is true then case 2 is essentially 201 Adds + 50 mem grabs

Share this post


Link to post
Share on other sites
Congratulations, you have just re-discovered loop-unrolling. There is some small per-iteration overhead when processing a loop. Your second bit of code has half as many iterations and thus has less overhead. The penalty you pay for this is increased code size and (often) harder to understand code.

You can either do it manually like you've done here or your compiler may have options that will allow it to do this automatically to a certain extent.

As always be sure to profile.

Share this post


Link to post
Share on other sites
Thanks. I always thought loop-unrolling as x += SomeArray[1], ... x += SomeArray[N}. What do you use to profile your code? I'm looking for something that is free and runs with Windows apps (assuming one exists).

Share this post


Link to post
Share on other sites
You guys are way off. Its an assembly/pipeline issue. You have to understand the pipeline. There are 5 steps (i think) to issuing a command. If you use a variable while it is in the pipeline, you have to restart the pipeline to make sure it doesn't mess things up.

In your first example, single streaming the adds:
a += b;
a += c;
a += d;
...
The pipeline has to restart each time "a" changes because "a" is used again.

In the second example:
a += b;
b += c;
a += d;
b += e;
a += b;
This is double-streamed. The pipeline does not have to restart because each line does not interfere with the previous line's variables (except for at the end when they are combined).

This is hardcore assembly optimization and will not amount to anything reasonable. The most important use for this type of double-streaming is for multiplication. Multiplication takes much longer to execute (like 30 times longer than an add). This ensures that the pipeline has time to execute the previous statement... I'm getting to the point where I don't know what I'm talking about. Look up how the pipeline works. I learned a lot of that stuff when I was programming assembly.

Share this post


Link to post
Share on other sites
Quote:
Original post by blaze02
You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.

For another, that has nothing to do with this post. Loop unrolling is what this is about, and the speed benefit is due to the reduction of conditional jumps. A conditional jump is when the code jumps to a different location depending on the value of something. 'if' statements are conditional jumps, as are 'while' statements, and virtual functions and function pointers result in the same problem. The reason this is slow is that the program does not know which code will be executed until it has a chance to evaluate the condition, and many optimizations are done behind the scenes (such as caching) that depend on the CPU knowing exactly which code will be executed in order. Non-conditional jumps, such as gotos and infinite loops, are perfectly fast.

It's easier to explain with an example. Say we have this C++ code:

for( int N = 0; N < Count; N ++ )
Sum *= Sum;


Here's equivalent assembly:


mov eax, Sum // EAX stores Sum
mov ecx, 0 // ECX acts as 'N', initially set to 0
mov ebx, Count // EBX stores Count

for_loop: cmp ecx, ebx // Compare N to Count
jge done // CONDITIONAL JUMP. If N >= Count, go to 'done'

imul eax // multiply Sum with itself

inc ecx // Increment N
jmp for_loop // NONCONDITIONAL JUMP back to the for loop

done: mov Sum, eax // Copy EAX back into Sum


That 'jge done' is what's called a conditional jump. It's slow and stalls the CPU.

EDIT: because Extrarius made me.

Share this post


Link to post
Share on other sites
Quote:
Original post by CGameProgrammer
[...]Here's equivalent assembly:


mov eax, Sum // EAX stores Sum
mov ecx, 0 // ECX acts as 'N', initially set to 0
mov edx, Count // EDX stores Count

for_loop: cmp ecx, edx // Compare N to Count
jge done // CONDITIONAL JUMP. If N >= Count, go to 'done'

imul eax // multiply Sum with itself

inc ecx // Increment N
jmp for_loop // NONCONDITIONAL JUMP back to the for loop

done: mov Sum, eax // Copy EAX back into Sum
[...]
Not quite. 'imul' stores a result double the bit width of the operand, so when you multiply by the 32-bit 'eax' register, the result goes into the 64-bit 'edx:eax' register pair, which overwrites 'count'. Change all instances of edx to ebx and it should be good.

Share this post


Link to post
Share on other sites
Quote:
Original post by Extrarius
Not quite. 'imul' stores a result double the bit width of the operand, so when you multiply by the 32-bit 'eax' register, the result goes into the 64-bit 'edx:eax' register pair, which overwrites 'count'. Change all instances of edx to ebx and it should be good.

Oh yeah, I forgot. (looks sheepish) It's been a while since I wrote optimized assembly.

Share this post


Link to post
Share on other sites
Quote:
Original post by blaze02
You guys are way off. Its an assembly/pipeline issue.

Nope, and certainly not on modern processors. This pipeline of yours is what most RISC CPUs (MIPS, Alpha, SPARC, etc.) as well as the P5 (Pentium) were like. Starting with the P6 (Pentium Pro, Pentium II and on up) processor now execute instructions out of order, and there is no clear "pipeline." To be more complete, the processor has several pipelines (a few integer pipelines, a few floating point pipelines, etc.) that execute concurrently.

In fact, due to the pipelined execution units, out-of-order execution, and, most importantly, register renaming, modern CPUs can essentially unroll loops on the fly.

The only thing that may hurt is, as the posters immediately above me mentioned, the branches. The CPU must predict whether the branch is taken or not. If it predicts wrong then it will do alot of useless work and waste a lot of time. Due to this speculative execution, you will notice this type of slowdown.

Modern branch prediction, however, is incredibly good (generally on the order of 96-97+% accurate). Performing a loop like that with 100 iterations, the CPU will certainly mispredict the final branch (when the loop is done) and may mispredict the very first branch. Due to how branch prediction works, it will get every other branch correct and unroll the loop dynamically at run time.

So, in short, those two pieces of code will execute in the same time on a modern CPU. In fact, the first one is probably quicker as it gets rid of the final addition. But we're talking a few clock cycles here.

Share this post


Link to post
Share on other sites
Quote:
Original post by penwan
In fact, due to the pipelined execution units, out-of-order execution, and, most importantly, register renaming, modern CPUs can essentially unroll loops on the fly.

Yes, but with a lot of restrictions. Compilers often have a hard time figuring out how much it's safe to unroll a loop, because of aliasing and various dependencies.

Anyway, that's beside the point. There certainly *is* a noticeable performance bottleneck in the original version, and it's not the branches. (They might add a handful of cycles because it'll guess wrong in the last and maybe the first iteration)

Floating point addition has a latency of 4+ cycles on modern PC's.
So always adding to the same variable (sum) means you have to wait 4 cycles between each addition. You can't add anything to a variable before you know its value, not even with pipelining and out of order cpu's. So you waste time.
The second version should be quite a bit faster, since you store the temp results in two independent variables. Hence, you don't need to wait for the previous addition to complete before you start on the following one.
You would probably gain a bit more (but not much more) by making 3 or even 4 such temp variables.
Ideally, you'd want to have one addition retire in every cycle. (or two, if your cpu can handle that. I know Athlon 64 can't do that though, and I doubt P4 can either)

In fact, AMD has a similar example in their optimization guide. The only difference is they use multiplication instead of addition, and they say to perform 4 at a time, to match the 4-cycle latency. But the principle is the same. Allow it to perform an operation every cycle, without having to wait for the previous result to complete.

And sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]
Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!