Back to General and Gameplay Programming

Out Of Order Excution Performance

General and Gameplay Programming Programming

Started by fathom88 March 13, 2006 06:49 AM

8 comments, last by GameDev.net 18 years ago

fathom88

180

Author

March 13, 2006 06:49 AM

I've been toying with out of order execution to boost performance. Too my surprise, it has really increased performance by good amount. However, I've noticed a sudden drop. I'm not sure why. ///does amazingly well for(int i = 0; ii < MAX_SIZE; ++ii) { sum1 = data[ii] + data_2[ii]; sum2 = data_3[ii] + data_4[ii]; ... ... sum8 = data_15[ii] + data_16[ii]; } //sudden dramatic lose in performance for(int i = 0; ii < MAX_SIZE; ++ii) { sum1 = data[ii] + data_2[ii]; sum2 = data_3[ii] + data_4[ii]; ... ... sum8 = data_15[ii] + data_16[ii]; sum1_b = data[ii] + data_2[ii]; sum2_b = data_3[ii] + data_4[ii]; ... ... sum4_b = data_6[ii] + data_7[ii]; } I added a few more operations in the loop and it kills my performance on a Pentium M 2.0. I tried the same code on a P4 3.2 and my performance was back up back to normal. Does the difference in CPU speed make the difference in this case? Or is it something on the chip itself (maybe more L2 cache?)? Thanks for any help. Sorry if the question is a little too newbie.

Greig Hamilton

181

March 13, 2006 08:43 AM

The caching of the code can make a big performance difference which is probably what you are seeing.

Code is cached so say you have 16k of cach and it is split up so that there are 4k of addresses and each address can contain 4 items. The code is then hashed to a certain address in the cache. So sometimes you will get 5 items hashing to the same address but only 4 available spaces. This causes the bits of code to constantly be switched with each other everytime the code is executed. This will cause a drop in performance.

By adding the extra bit of code you offset the bits of code that get called a lot enough so that the above was not happening.

Also if you have a main loop which has a code size greater than the cache then you will see a performance drop.

Hopefully this makes a bit of sense to you. Someone else may be able to explain it better.

Greig

www.arcticbluegames.com

Extrarius

1,412

March 13, 2006 09:30 AM

The above is correct, I just wanted to clarify that the problem is almost certainly due to L1 cache rather than some other size. I don't know about intel machines, but AMDs for the last many years have had at least 64kb cache each for code and data that is shared among all processes on the system.

"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk

fathom88

180

Author

March 13, 2006 01:12 PM

Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.

Conner McCloud

1,135

March 13, 2006 01:20 PM

Quote:Original post by fathom88
Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.

You already have a solution...reduce the size of your loop.

CM

Spoonbender

1,258

March 13, 2006 01:34 PM

Quote:Original post by Conner McCloud
Quote:Original post by fathom88
Sorry if this is a silly question. Is there a way around this? Perhaps break the routine into two separate calls which get executed consecutively.

You already have a solution...reduce the size of your loop.

CM

Hehe yeah, like all performance optimizations, this trick is only good in moderation. Sooner or later, you start causing other bottlenecks. So the way "around" this is to just find a loop size that's suitable for all cpu's.

The Pentium M has 32KB L1 instruction cache as far as I know. As far as I can remember, the P4 has 16k uops, which I'd expect translates to fewer instructions than the Pentium M.
And anyway, I find it unlikely that L1 cache should pose a problem after only what, 12 iterations?
Sounds unlikely to me. At least, I really doubt that's the full explanation.

An alternative suggestion is that it might just a lack of registers. You've got 8 registers (although not all of them may be free). If you try to do more than 8 parallel operations per iteration, some registers may have to be pushed to the stack instead, which is quite a bit slower.

You could always just take a look at the assembly code it produces, and see if any registers are spilled.

fathom88

180

Author

March 13, 2006 03:01 PM

Loop UN-Rolling did the trick. I divided my MAX_SIZE by 4 and stepped my iteration by 4. I created enough variables to do four sets of calcs per iteration. At first, I tried to do the calc on the 1st half and then the second half. Why did this not yield a performance boost? When do things enter and exit the L1 cache?

int half_max= MAX_SIZE/2;
for(i = 0; ii < half_max; ++ii)
{
///do 1st half
}

for(i = (half_max -1); ii < MAX_SIZE; ++ii)
{
///do 2nd half
}

Thanks for the replies.

Conner McCloud

1,135

March 13, 2006 03:21 PM

Quote:Original post by fathom88
Loop UN-Rolling did the trick. I divided my MAX_SIZE by 4 and stepped my iteration by 4. I created enough variables to do four sets of calcs per iteration. At first, I tried to do the calc on the 1st half and then the second half. Why did this not yield a performance boost? When do things enter and exit the L1 cache?

int half_max= MAX_SIZE/2;
for(i = 0; ii < half_max; ++ii)
{
///do 1st half
}

for(i = (half_max -1); ii < MAX_SIZE; ++ii)
{
///do 2nd half
}

Thanks for the replies.

For all practical purposes, that is no different than a single loop. You're performing the same number of iterations, and the same operations on each iteration.

CM

aboeing

180

March 13, 2006 07:41 PM

Quote:
For all practical purposes, that is no different than a single loop. You're performing the same number of iterations, and the same operations on each iteration.

Yeah, all your doing is helping out your compiler in doing a manual unrolling. With an optimizing compiler like VectorC or ICC you can use some pragmas and hints to achieve this and you should see some nice fast unrolled and vectorised code.

me:www.adrianboeing.com

Anonymous

March 28, 2006 10:56 PM

Get a profiler ( One I recommend is ProfileSharp from SoftProdigy ( www.softprodigy.com ) and find out which lines of code are running slow.

Out Of Order Excution Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Out Of Order Excution Performance

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines