Today's games target a diverse range of hardware, spanning dozens of different hardware designs and architectures, many various standards and interfaces, and a veritable minefield of varying concerns and design considerations for each. One particularly devilish challenge is balancing code for execution on both out-of-order and in-order CPU execution models. This GDC lecture offered a plethora of hints and advice for working on in-order architectures.
It is possible that vectorization and use of compiler intrinsics fail to improve on the speed of the optimizing compiler's output. Therefore it is important to profile carefully to ensure that tricky vectorization/intrinsics modifications are actually being beneficial.
Writing code for the PC, on which out-of-order execution models are the de facto standard, can introduce severe difficulties with porting to in-order CPUs on current consoles. It is usually advisable to tune for performance on the in-order platform first and then backport to the PC.
Several areas can cause serious performance implications for in-order CPUs. These are, briefly: load/hit/store sequences (where a variable is read, modified, then written back to memory); L2 cache misses; use of non-pipelined CPU instructions; and branch misprediction. Analysis of code at the generated assembly level is often necessary to catch and correct such issues.
Ensuring cache coherency is important. Design code to use algorithms that have good cache locality.
Separate "hot" data, which is accessed often, from "cold" data, which is used less often; this helps ensure that hot data is in the cache as much as possible.
Minimize the use of branches in inner loops and other performance-critical regions. Understanding the compiler and CPU's implementation of branching is important here.
It cannot be stressed enough: profile heavily to ensure you are actually working on performance hotspots, and to check that any changes are actually benefiting performance.
On the Xbox 360, useful tools include the PIX CPU instruction trace feature, LibPMCPB counters, and state sampling via XbPerfView. Details on these tools is outside the scope of this lecture, but can be easily located on the web.
Other useful tools include SN's Tuner and Intel's vTune
Whenever possible, record your assumptions about a piece of code, and verify them. Incorrect assumptions are a leading cause of misguided optimization, often with bad results (i.e. things get slower instead of faster).
Use inlining where possible; but as always profile to ensure that the code size increase does not lead to a net performance loss.
On the 360, the __declspec(passinreg) directive can be used to pass function parameters in registers instead of the stack, which saves some expensive memory hits
In C++, const correctness is important. It does not directly affect generated code, but it is invaluable for locating code that is accessing memory more heavily than it needs to
Inline assembly is almost always going to produce bad performance results, because it disables compiler optimizations. Use of compiler intrinsics avoids this issue.
Minimize the number of parameters passed around, and the number of hits to the stack. This is usually necessary at the algorithm design level and is very difficult to accomplish at the micro-optimization level.
Use the native data size as much as possible (e.g. stick to 32, 64, 128 bit data). Use of smaller data elements (8 bits, 16 bits) requires extra CPU instructions and can incur quite a few penalties in the CPU pipeline
Avoid virtual function calls when possible; again this is largely a requirement at the algorithmic level
Know your cache architecture, and design specifically for it. Prefetching is a great tool for performance improvements, but requires fairly involved knowledge of the cache design.
Avoid aliasing when possible. Also avoid swapping register sets often (i.e. moving between integer and floating point registers, or even SIMD registers)
It may be tempting to use an if to skip code in order to boost performance, but this can backfire: cache misses, branch misprediction, and load/hit/store penalties can all worsen the situation overall
Do lots of regression testing, both to ensure that optimizations have not introduced bugs, and to ensure that "optimizations" are actually improving performance
Never guess.
Don't assume that performance issues are localized to a single area; they may be caused by fairly long and sophisticated "chain reactions" in the code. Keep the larger picture of the code's behaviour in mind and avoid getting too bogged down in tiny details.
Conversion between data formats is murderously expensive; for example, integer/floating point casts
Be careful with SIMD instructions; they may actually cause net performance decrease due to unaligned loads, cache misses, and pipeline stalls
Don't forget to use assertions and unit tests to ensure that optimizations are not breaking functionality
Isolate code that makes lots of assumptions; hide it behind a solid interface and then try to improve/reduce the assumptions as much as possible