Odd Performance Question

Started by
23 comments, last by Spoonbender 18 years, 1 month ago
Quote:Original post by Spoonbender
And sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]
Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.


Most of my experience has been with 486 assembly. I haven't read up on the pentium+ pipelines. Yeah, for single loops, the compiler will "guess" the correct next operation like 99% of the time. In fact I wouldn't worry about anything this low level ever; or you might as well be writing assembly code. You are talking 1-2 microseconds difference (AT BEST!) with this optimization. The reason games slow down is an entirely different issue. Spend your time learning what graphic render states take the longest to swap in and out and group accordingly. If you plan on handling a ton of objects, look into the possibility of a hash_map instead of a vector or list. These types of speed-ups you will really be able to see and use.
-------Harmotion - Free 1v1 top-down shooter!Double Jump StudiosBlog
Advertisement
Quote:Original post by CGameProgrammer
Quote:Original post by blaze02
You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.


HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks". And don't ever quote me as saying [irrelevance follows].
-------Harmotion - Free 1v1 top-down shooter!Double Jump StudiosBlog
Bah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.
-Mike
Quote:Original post by blaze02
Quote:Original post by Spoonbender
And sheesh, I'm surprised everyone (except possibly blaze02) missed this. [wink]
Screw the branching issues. It doesn't take a complex branch predictor to handle something like this loop. What matters is the data dependencies that prevent it from pipelining efficiently.


Most of my experience has been with 486 assembly. I haven't read up on the pentium+ pipelines. Yeah, for single loops, the compiler will "guess" the correct next operation like 99% of the time. In fact I wouldn't worry about anything this low level ever; or you might as well be writing assembly code. You are talking 1-2 microseconds difference (AT BEST!) with this optimization. The reason games slow down is an entirely different issue. Spend your time learning what graphic render states take the longest to swap in and out and group accordingly. If you plan on handling a ton of objects, look into the possibility of a hash_map instead of a vector or list. These types of speed-ups you will really be able to see and use.


Well, this one might be noticeable if you run it in a really tight inner loop. (If we're talking 100 iterations, you won't be able to measure a difference. If this is where 90% of your app's execution time is spent, I can assure you the second version will be a hell of a lot faster than the first) [wink]
Quote:Original post by blaze02
Quote:Original post by CGameProgrammer
Quote:Original post by blaze02
You guys are way off. [irrelevance follows]

Um, no. For one thing, you sort of combined pairing and pipelining -- pipelining is allowing a multi-clock instruction, like an integer multiplication (3 clocks), to run its 2nd and 3rd clocks in the background while new instructions are fetched.


HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

You'll do well to read the Intel PDF titled "Intel Architecture Optimization", specifically Appendix C.

Quote:Original post by blaze02
HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks". And don't ever quote me as saying [irrelevance follows].

I remember it being exactly 3 clocks for the Pentium II and above. An integer division is around 20-30 clocks IIRC. Yes, it was really slow on a 486, but modern processors are not overclocked 486s.
~CGameProgrammer( );Developer Image Exchange -- New Features: Upload screenshots of your games (size is unlimited) and upload the game itself (up to 10MB). Free. No registration needed.
Oh man. The amount of disinformation here is worrisome. Thanks to Spoonbender for finally getting it right :)

Quote: speed benefit is due to the reduction of conditional jumps. [..] The reason this is slow is that the program does not know which code will be executed until it has a chance to evaluate the condition

Not true. Correctly predicted conditional jumps can be considered to not have *any* overhead at all (due to Branch Target Buffer). It's mispredicting a jump that is the killer.

Quote:The only thing that may hurt is, as the posters immediately above me mentioned, the branches.

Noo! We are talking about a loop, which is well-predicted.

Quote:please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

It is indeed. 486 had external FPU; it was integrated into the 586 core and has been further improved since (several execution units added).

Quote:Bah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.

Nope, loop unrolling was relevant several years ago. The thing to be thinking of nowadays is "software pipelining"; Google turns up nice intro article.
In short it's about minimizing dependencies so you can make use of 'all' the execution units in your N-way superscalar CPU and hide latency. The above example is a very simple case of this.

Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Quote:Original post by Anon Mike
Bah, people are always making this stuff more complex than it has to be. "loop unrolling" is all you really need to think about. All that minuscule-level processor giberish merely becomes more efficient because of the unrolling.


Er no. Loop unrolling is exactly what doesn't make much difference.
"Bah, people are always assuming cpu performance is simpler than it is." [wink]
You can unroll the loop as much as you like, and you won't see more than a few percent improvement in performance.
What the OP actually asked about might give you 50-75% extra performance, at a rough estimate.

Quote:
HAHAHA, please somebody tell me this is wrong. The multiply opcode took 30+ clock cycles in 486. No way its down to "3 clocks".

Sounds about right. Athlon 64 has 4 cycles for fp mul. Keep in mind, this is just the latency before the result is available to other instructions. Of course, the entire isntruction, from fetch to retire, takes quite a bit longer.
And fp math has improved quite a bit since the 486 days (with external fpu's and in-order execution)

Quote:
Oh, and this "minuscule-level processor giberish" can result in 2.5x speed improvement (quoted in above article). I'll take it!

You can get much more than 2.5x in some circumstances.
I did a project at uni last semster where I ended up with around a 10x speed improvement, purely from "minuscule gibberish" like this.
All it takes is a good understanding of cpu architecture and performance.
Quote:Original post by blaze02
And don't ever quote me as saying [irrelevance follows].


Don't take this the wrong way. What are you going to do about it?
Thanks for all the replies. It got me thinking about other things that I can do to increase my performance.

P.S. My performance bottleneck is not inside my render routines. I simply use OpenGL to draw my data set. Calculating my data set is the hard part. It would normally be done on an FPGA, but I have to use a laptop to do the calcs. I don't think what I'm doing would have been possible a few years ago. However, I think modern CPU's have advanced enough to the point where it can be done; just not easily.


Thanks again.

This topic is closed to new replies.

Advertisement