Out Of Order Excution Performance

Started by
8 comments, last by GameDev.net 18 years ago
I've been toying with out of order execution to boost performance. Too my surprise, it has really increased performance by good amount. However, I've noticed a sudden drop. I'm not sure why. ///does amazingly well for(int i = 0; ii < MAX_SIZE; ++ii) { sum1 = data[ii] + data_2[ii]; sum2 = data_3[ii] + data_4[ii]; ... ... sum8 = data_15[ii] + data_16[ii]; } //sudden dramatic lose in performance for(int i = 0; ii < MAX_SIZE; ++ii) { sum1 = data[ii] + data_2[ii]; sum2 = data_3[ii] + data_4[ii]; ... ... sum8 = data_15[ii] + data_16[ii]; sum1_b = data[ii] + data_2[ii]; sum2_b = data_3[ii] + data_4[ii]; ... ... sum4_b = data_6[ii] + data_7[ii]; } I added a few more operations in the loop and it kills my performance on a Pentium M 2.0. I tried the same code on a P4 3.2 and my performance was back up back to normal. Does the difference in CPU speed make the difference in this case? Or is it something on the chip itself (maybe more L2 cache?)? Thanks for any help. Sorry if the question is a little too newbie.
Advertisement
The caching of the code can make a big performance difference which is probably what you are seeing.

Code is cached so say you have 16k of cach and it is split up so that there are 4k of addresses and each address can contain 4 items. The code is then hashed to a certain address in the cache. So sometimes you will get 5 items hashing to the same address but only 4 available spaces. This causes the bits of code to constantly be switched with each other everytime the code is executed. This will cause a drop in performance.

By adding the extra bit of code you offset the bits of code that get called a lot enough so that the above was not happening.

Also if you have a main loop which has a code size greater than the cache then you will see a performance drop.

Hopefully this makes a bit of sense to you. Someone else may be able to explain it better.

Greig
The above is correct, I just wanted to clarify that the problem is almost certainly due to L1 cache rather than some other size. I don't know about intel machines, but AMDs for the last many years have had at least 64kb cache each for code and data that is shared among all processes on the system.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.
Quote:Original post by fathom88
Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.

You already have a solution...reduce the size of your loop.

CM
Quote:Original post by Conner McCloud
Quote:Original post by fathom88
Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.

You already have a solution...reduce the size of your loop.

CM


Hehe yeah, like all performance optimizations, this trick is only good in moderation. Sooner or later, you start causing other bottlenecks. So the way "around" this is to just find a loop size that's suitable for all cpu's.

The Pentium M has 32KB L1 instruction cache as far as I know. As far as I can remember, the P4 has 16k uops, which I'd expect translates to fewer instructions than the Pentium M.
And anyway, I find it unlikely that L1 cache should pose a problem after only what, 12 iterations?
Sounds unlikely to me. At least, I really doubt that's the full explanation.

An alternative suggestion is that it might just a lack of registers. You've got 8 registers (although not all of them may be free). If you try to do more than 8 parallel operations per iteration, some registers may have to be pushed to the stack instead, which is quite a bit slower.

You could always just take a look at the assembly code it produces, and see if any registers are spilled.
Loop UN-Rolling did the trick. I divided my MAX_SIZE by 4 and stepped my iteration by 4. I created enough variables to do four sets of calcs per iteration. At first, I tried to do the calc on the 1st half and then the second half. Why did this not yield a performance boost? When do things enter and exit the L1 cache?

int half_max= MAX_SIZE/2;
for(i = 0; ii < half_max; ++ii)
{
///do 1st half
}

for(i = (half_max -1); ii < MAX_SIZE; ++ii)
{
///do 2nd half
}

Thanks for the replies.

Quote:Original post by fathom88
Loop UN-Rolling did the trick. I divided my MAX_SIZE by 4 and stepped my iteration by 4. I created enough variables to do four sets of calcs per iteration. At first, I tried to do the calc on the 1st half and then the second half. Why did this not yield a performance boost? When do things enter and exit the L1 cache?

int half_max= MAX_SIZE/2;
for(i = 0; ii < half_max; ++ii)
{
///do 1st half
}

for(i = (half_max -1); ii < MAX_SIZE; ++ii)
{
///do 2nd half
}

Thanks for the replies.

For all practical purposes, that is no different than a single loop. You're performing the same number of iterations, and the same operations on each iteration.

CM
Quote:
For all practical purposes, that is no different than a single loop. You're performing the same number of iterations, and the same operations on each iteration.

Yeah, all your doing is helping out your compiler in doing a manual unrolling. With an optimizing compiler like VectorC or ICC you can use some pragmas and hints to achieve this and you should see some nice fast unrolled and vectorised code.
Get a profiler ( One I recommend is ProfileSharp from SoftProdigy ( www.softprodigy.com ) and find out which lines of code are running slow.

This topic is closed to new replies.

Advertisement