Developing a modern game is often a multi-year process, involving vast amounts of programming, art and content creation, and so on. As if this weren't enough challenge by itself, hardware also continues to advance during the development process, so that commonly available commodity hardware is significantly more advanced than the hardware was at the beginning of development.
Nowhere is this change as dramatic or rapid as with CPU advancement. CPU cores and architectures are changing radically, especially with new multi-core designs taking root in the market. An average video game may actually see the rise and fall of several different CPU architectures during its development lifetime. In this lighting-paced world, it can be extremely difficult to take advantage of all the hardware that is available - if for no other reason than the fact that you may be designing a game around hardware that will be obsolete by the time the title ships.
Tips and tricks for dealing with the pace of CPU change
CPUs are no longer fully interchangeable, especially with the rise of 64-bit processors. Therefore it is vital to test code on as wide of a variety of CPUs as possible.
As core counts continue to increase, knowing how to design and implement solid multithreaded code will become increasingly vital
SIMD is universally available; there's no longer any reason not to take advantage of it. SSE2 is essentially available on all modern CPUs, and processors from the dual-core era onward support at least SSE3.
Knowing how to optimize cache usage is crucial; be careful not to stomp on other threads' cache setups
Communication that has to cross the front-side bus will be severely slow; this includes talking between CPUs (on certain architectures), talking to hardware devices, and even certain memory access patterns
Avoid using the heap when possible; prefer to use special memory pools (for cache locality and cheap allocations) or the stack (but avoid bad hacks like alloca)
In some situations where data must be queued, consider LIFO stacks instead of more traditional FIFO/queue structures, in order to improve cache locality
Employ data parallelism as much as possible; future CPU architectures will continue to improve the benefits of such approaches
Fine-grained tasks are often better than large, monolithic tasks, because they can be subdivided and spread across more CPU cores. However, beware of creating a scenario that involves huge amounts of locking to communicate between tasks, as this can completely destroy performance
Avoid sleeping a thread as much as possible, especially if doing so leaves one or more cores idle; this can create a situation where the core never actually sleeps (since there is nothing to yield to) and the cache can be thoroughly trashed due to context switching
Always profile your code, and especially perform regression tests after making an optimization to ensure that it is not in fact making a negative impact
Detecting CPU topology is very difficult, but there are some useful techniques here
Setting core/CPU affinities is dangerous and tricky; avoid doing so wherever possible, or at the very least allow users to disable affinitization. Chances are the OS and CPU architecture will function better when left to distribute code across cores as they see fit.
Additional Resources Slides from the talk are available here.
Intel offers a comprehensive set of manuals for their processors here. Their Thread Building Blocks library also demonstrates some highly effective techniques for working with threaded code.