# Quick Silly Question On Performance

This topic is 4601 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I was reading AMD's Optimization guide. I cam across some interesting facts. The guide does not always give good explanations. SLOW ------------ double x[MAX], y[MAX], z[MAX]; unsigned int k; for(k = 1; k < MAX; k++) { x[k] = x[k-1] + y[k]; } FAST -------- double x[MAX], y[MAX], z[MAX]; unsigned int k; double t; t = x[0]; for(k = 1; k < MAX; k++) { t =t + y[k]; x[k] = t; } The manual calls it a store-and-load dependency. I always wrote my code the 2nd way because it was cleaner than writing a long equation and then setting it to a variable. However, I always assumed it was slower because of the extra sets to my extra temp variables. They also mention parallelism. I've been using it for some time. However, I was a little confused by their example. SLOW ------------- double a[MAX], sum; int i; sum = 0.0; for(i=0; i < MAX; ++i) { sum +=a; } FAST ------------- double a[MAX], sum1, sum2, sum3, sum4, sum; int i; sum = 0.0; for(i=0; i < MAX; i += 4) { sum1 +=a; sum2 +=a[i + 1]; sum3 +=a[i + 2]; sum4 +=a[i + 3]; } sum = sum1 + sum2 + sum3 + sum4; In the 2nd example, how can sum2 begin to execuate without 1st finding the result of i + 1? Thanks in advance for clearing things up.

##### Share on other sites
It seems to me that the 2nd example would only work if MAX is divisable by 4.

##### Share on other sites
Quote:
 Original post by fathom88SLOW------------double x[MAX], y[MAX], z[MAX];unsigned int k;for(k = 1; k < MAX; k++){x[k] = x[k-1] + y[k];}FAST--------double x[MAX], y[MAX], z[MAX];unsigned int k;double t;t = x[0];for(k = 1; k < MAX; k++){t =t + y[k];x[k] = t;}The manual calls it a store-and-load dependency.

Basically, there's nothing wrong with some temp variables. That's just a register. Loads and stores are memory accesses though, which can take forever. Keep in mind that arrays accesses will typically mean a memory access.
In the first version, you need to load the k-1'th array element in every iteration. But that was the one you stored a new value to in the previous iteration. That means the previous store has to be entirely finished before you can even start loading the value out again. That *can* take forever.
In the second version, you dont access the k-1'th element, instead you just keep a temp variable around which holds the value across iterations. So you dont need to load the address you just stored to a moment ago. And that means you dont have to wait for the store operation to actually finish. You just have to start it, and then you can begin on the next iteration.

Quote:
 They also mention parallelism. I've been using it for some time. However, I was a little confused by their example.SLOW-------------double a[MAX], sum;int i;sum = 0.0;for(i=0; i < MAX; ++i){sum +=a;}FAST-------------double a[MAX], sum1, sum2, sum3, sum4, sum;int i;sum = 0.0;for(i=0; i < MAX; i += 4){sum1 +=a;sum2 +=a[i + 1];sum3 +=a[i + 2];sum4 +=a[i + 3];}sum = sum1 + sum2 + sum3 + sum4;

It can't. But calculating i+1 isn't the slow part. (If I recall correctly, integer addition only takes a singe cycle on modern CPU's)
Before, you had to compute a new index before you could load out the value from hte arra. (And again, loading from memory might take ages) Then you had to wait for that to finish before you could add it to the sum, so you could start on the next iteration.
But why do you have to wait for sum to be updated in every iteration? Why not compute the index, load from the array, add it to one variable, *while* computing the next index, loading from somewhere else in the array, and adding it to a second variable? The two operations are independant, so the second doesn't have to wait for the first one to complete. They change different variables, so they don't have to wait for each others.

Modern CPU's are heavily pipelined, which means that they can start new instructions before the previous ones are finished executing. But that obviously only works if the instruction doesn't *depend* on the previous one. Moreover, they actually execute up to 3 instructions at the same time, every cycle. That's why this optimization can give you a big performance gain. The less dependencies there are between your instructions, the easier they are to execute in parallel.

Oh yeah, and true, the last version only works if the number of iterations is divisible by 4. That is easy to work around though. You can just add a second "normal" loop afterwards to take care of the last 0-3 iterations. AMD just left that out because it wasn't important to the optimization.

##### Share on other sites
Only one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.

##### Share on other sites
Quote:
 Original post by Anonymous PosterOnly one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.

If you are talking about the second example, then if a compiler tries to use SSE it would most likely also optimize the second. You have four adds right after each other, not dependent on each other and even continous in memory. Really it's the best input ADDPS could get. The only problem is that the inputs is double precision and ADDPS takes single precision floating point values, but if it can use SSE2 it could just use ADDPD (D = double, S = single).

##### Share on other sites
Quote:
 Original post by Anonymous PosterOnly one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.

The whole point of these examples is to show that what matters far more than counting cycles increasing parallellism (even on single-core processors). If you've got a good compiler (although I've heard many people say that there aren't many that are good when it comes to vectorizing) then I don't see any reason why the FAST examples couldn't be optimized to make use of SIMD, while at the same time still receiving the advantages of reducing memory dependencies between instructions.

##### Share on other sites
True, but I've never seen a compiler *actually* generate SSE code (or at least, not the packed instructions (*PD or *PS), even with obvious input like this.

Oh, and at a closer look, it's not very obvious for SSE at all.
Each iteration depends on the previous one. That's not very SSE-friendly.
A SSE version might still be able to be a bit faster, but the dependence between iterations is a major problem there.

1. 1
2. 2
3. 3
Rutin
15
4. 4
5. 5

• 10
• 9
• 9
• 11
• 11
• ### Forum Statistics

• Total Topics
633685
• Total Posts
3013322
×