Quick Silly Question On Performance

Started by
6 comments, last by Spoonbender 17 years, 11 months ago
I was reading AMD's Optimization guide. I cam across some interesting facts. The guide does not always give good explanations. SLOW ------------ double x[MAX], y[MAX], z[MAX]; unsigned int k; for(k = 1; k < MAX; k++) { x[k] = x[k-1] + y[k]; } FAST -------- double x[MAX], y[MAX], z[MAX]; unsigned int k; double t; t = x[0]; for(k = 1; k < MAX; k++) { t =t + y[k]; x[k] = t; } The manual calls it a store-and-load dependency. I always wrote my code the 2nd way because it was cleaner than writing a long equation and then setting it to a variable. However, I always assumed it was slower because of the extra sets to my extra temp variables. They also mention parallelism. I've been using it for some time. However, I was a little confused by their example. SLOW ------------- double a[MAX], sum; int i; sum = 0.0; for(i=0; i < MAX; ++i) { sum +=a; } FAST ------------- double a[MAX], sum1, sum2, sum3, sum4, sum; int i; sum = 0.0; for(i=0; i < MAX; i += 4) { sum1 +=a; sum2 +=a; sum3 +=a; sum4 +=a; } sum = sum1 + sum2 + sum3 + sum4; In the 2nd example, how can sum2 begin to execuate without 1st finding the result of i + 1? Thanks in advance for clearing things up.
Advertisement
It seems to me that the 2nd example would only work if MAX is divisable by 4.
Simplicity is the ultimate sophistication. – Leonardo da Vinci
Quote:Original post by fathom88
SLOW
------------

double x[MAX], y[MAX], z[MAX];
unsigned int k;

for(k = 1; k < MAX; k++)
{
x[k] = x[k-1] + y[k];
}

FAST
--------
double x[MAX], y[MAX], z[MAX];
unsigned int k;
double t;

t = x[0];
for(k = 1; k < MAX; k++)
{
t =t + y[k];

x[k] = t;
}

The manual calls it a store-and-load dependency.

Basically, there's nothing wrong with some temp variables. That's just a register. Loads and stores are memory accesses though, which can take forever. Keep in mind that arrays accesses will typically mean a memory access.
In the first version, you need to load the k-1'th array element in every iteration. But that was the one you stored a new value to in the previous iteration. That means the previous store has to be entirely finished before you can even start loading the value out again. That *can* take forever.
In the second version, you dont access the k-1'th element, instead you just keep a temp variable around which holds the value across iterations. So you dont need to load the address you just stored to a moment ago. And that means you dont have to wait for the store operation to actually finish. You just have to start it, and then you can begin on the next iteration.

Quote:
They also mention parallelism. I've been using it for some time. However, I was a little confused by their example.

SLOW
-------------

double a[MAX], sum;
int i;

sum = 0.0;

for(i=0; i < MAX; ++i)
{
sum +=a;
}

FAST
-------------
double a[MAX], sum1, sum2, sum3, sum4, sum;
int i;

sum = 0.0;

for(i=0; i < MAX; i += 4)
{
sum1 +=a;
sum2 +=a;<br>sum3 +=a;<br>sum4 +=a;<br>}<br><br>sum = sum1 + sum2 + sum3 + sum4;<br><!–QUOTE–></td></tr></table></BLOCKQUOTE><!–/QUOTE–><!–ENDQUOTE–><br>It can't. But calculating i+1 isn't the slow part. (If I recall correctly, integer addition &#111;nly takes a singe cycle &#111;n modern CPU's)<br>Before, you had to compute a new index before you could load out the value from hte arra. (And again, loading from memory might take ages) Then you had to wait for that to finish before you could add it to the sum, so you could start &#111;n the next iteration.<br>But why do you have to wait for sum to be updated in every iteration? Why not compute the index, load from the array, add it to &#111;ne variable, *while* computing the next index, loading from somewhere else in the array, and adding it to a second variable? The two operations are independant, so the second doesn't have to wait for the first &#111;ne to complete. They change different variables, so they don't have to wait for each others.<br><br>Modern CPU's are heavily pipelined, which means that they can start new instructions before the previous &#111;nes are finished executing. But that obviously &#111;nly works if the instruction doesn't *depend* &#111;n the previous &#111;ne. Moreover, they actually execute up to 3 instructions at the same time, every cycle. That's why this optimization can give you a big performance gain. The less dependencies there are between your instructions, the easier they are to execute in parallel.<br><br>Oh yeah, and true, the last version &#111;nly works if the number of iterations is divisible by 4. That is easy to work around though. You can just add a second "normal" loop afterwards to take care of the last 0-3 iterations. AMD just left that out because it wasn't important to the optimization.
Thanks for the reply!!!
Only one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.
Quote:Original post by Anonymous Poster
Only one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.


If you are talking about the second example, then if a compiler tries to use SSE it would most likely also optimize the second. You have four adds right after each other, not dependent on each other and even continous in memory. Really it's the best input ADDPS could get. The only problem is that the inputs is double precision and ADDPS takes single precision floating point values, but if it can use SSE2 it could just use ADDPD (D = double, S = single).
Quote:Original post by Anonymous Poster
Only one note on this concrete example. Compiler might do some improvements based on using SSE parallel instructions, ¿which one of the two versions (SLOW or FAST) could get a optimization of this kind easier?. FAST version will run faster due parallel processing, but i think the SLOW version can be easily optimized employing parallel instructions (like ADDPS) which means a huge speed improvement.


The whole point of these examples is to show that what matters far more than counting cycles increasing parallellism (even on single-core processors). If you've got a good compiler (although I've heard many people say that there aren't many that are good when it comes to vectorizing) then I don't see any reason why the FAST examples couldn't be optimized to make use of SIMD, while at the same time still receiving the advantages of reducing memory dependencies between instructions.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
True, but I've never seen a compiler *actually* generate SSE code (or at least, not the packed instructions (*PD or *PS), even with obvious input like this.

Oh, and at a closer look, it's not very obvious for SSE at all.
Each iteration depends on the previous one. That's not very SSE-friendly.
A SSE version might still be able to be a bit faster, but the dependence between iterations is a major problem there.

This topic is closed to new replies.

Advertisement