Let me start with multi-processing, or more accurately the challenges of parallelized programming. Current technology revolves around a model called Symmetric Multi-Processing, or SMP. SMP basically gives you a handful of identical processors, and lets you run code on them in parallel. In the early days, SMP meant literally having more than one CPU chip; nowadays, the trend is to cram the processors onto a single die, known as a multi-core architecture.
Intel's plans for the future involve what I've come to call core proliferation. The reason we are seeing such a big shift towards SMP, as Carmack outlined, is that we're close to the limit of what a single core can do. The logical place to continue getting speed boosts is to add more cores. Core proliferation, in a nutshell, is the tendency to continue adding cores as the primary means of increasing raw processing performance.
Core proliferation - and, really, the entire notion of SMP - has dire consequences for most programmers. (NB: I'm going to focus primarily on games here, partly because, well, this is GDNet, and partly because that's what I know best.) The bulk of performance-critical game code is written in C or C++ these days. The problem is, both languages really suck at writing parallelized code.
There are basically three main levels at which we can parallelize any given computing process. The first and highest level is separation of processes. An easily identifiable example of this is doing graphics on dedicated GPUs, and running game logic on the CPU. These are two different processes that can be done largely independently. Incidentally, this is also the same model that early multitasking operating systems used - divide things into (largely) unrelated processes, and each runs independently. When possible, they run in parallel. The "process" term is still in active use in OS kernels to discuss that level of organization.
The next level down is separation of steps, or (more familiarly) mulithreading. Here, we have several steps in a single process that can be done independently. Each is split into a "thread" and shunted off onto any available processor. Threading is almost exclusively done on CPUs, although programmable shaders represent a sort of analogue. The important difference here is that threads are often deeply interrelated.
The final level is separation of operations, where we actually look at the individual atomic instructions that the code executes on the CPU, and do some magic to make them run in parallel. Hyperthreading is exactly that. Back in the day, it's the tactic that Abrash and Carmack used in Quake to get performance gains there.
Now, I'm abusing terminology a little bit. Most of the time, multithreaded software is not actually divided in the separation-of-steps plane. In fact, today the majority of multithreading is done to accomplish separation of processes. As an interesting side note, this has been the cause of a not-so-minor philosophical war between various flavours of Unix and Windows; Windows tends to prefer splitting processes rarely and threads often, whereas the traditional Unix approach is to split processes (in the OS sense of the word) rather than threads.
This isn't because people are belligerent shmucks who don't like using words the same way I do. It's because of the programming languages we use.
Multithreading is, these days, more or less synonymous with separation of processes. At Egosoft, as we've looked at updating our technology for taking advantage of SMP in the future, we've approached it from the perspective of dividing up threads (and, by extension, CPUs) by processes, rather than by steps in an individual process.
The reason for this is interesting and all-important: separation of processes requires far less synchronization than separation of steps.
Let me explain a bit what I mean by that. Consider two individual processes, rendering graphics and rendering audio. The graphics process involves doing a lot of transformations, lighting, shader code, and eventually pushing around pixels. Audio requires doing some physical simulation of the environment, controlling volume levels, doing mixing, and eventually pushing raw waveform data to the sound hardware.
These processes are highly independent; the only time they really need to "sync up" is to make sure a gunshot sound goes off at the same time as the muzzle flash is drawn on the screen and the bullet flies out on it's way toward some hapless bastard's skull. Typically, these two processes are synchronized per-frame; that's the simplest way to effectively make sure they stay lined up. If graphics rendering takes longer (which it's almost guaranteed to do), the audio code waits around for a while; once they're both ready, they do the next frame's worth of work, and the cycle goes on.
By contrast, consider the act of splitting up AI into multiple threads. Suppose I've got fifty tanks all doing their thing in some virtual environment. They roam around and do stuff. Then, one tank decides to shoot at another - tank A wants to whack tank B. Unfortunately, tank B's position is currently being decided by AI thread number two, while tank A is being handled by thread number one.
What we have here is a classical synchronization problem. In order to know where to fire, tank A has to know tank B's position. But tank B is currently deciding where to move. Tank A has to wait until thread 2 finishes moving tank B, or risk a miss. However, what happens if tank A is the first tank handled by thread 1, and tank B is the last one handled by thread 2? Now thread 1 has to wait for all of thread 2 to finish running before it can start its first tank.
Now, there are some ways around this, obviously. The problem is, they are complicated, and bug-prone. I won't get into the technical side of how to solve that problem (it's actually one of the easier ones to solve), but that's a good taste of how complicated it is to split things up by steps rather than by processes. Believe me when I say it gets much worse than that - especially when you throw in the ultimate unknown: multiplayer.
For now, the answer is easy: just perform separation of processes, stick one process on each CPU (and maybe combine some of the quicker processes onto a single CPU), and you're done. You can avoid too many sticky synchronization issues, and still get a benefit from SMP.
That's a great solution, and will remain the staple way that game developers take advantage of SMP - for a while.
But eventually, if the CPU manufacturer's get their way, we will encounter core proliferation. What happens when you have 24 cores, but only 6 distinct processes? You're forced to either approach separation by steps, or waste 18 processors. But separation by steps is hard, and one of the biggest unexplored frontiers of software development today. The net result is that it will be hard to develop games - which, in turn, means development will get more expensive.
So are we all doomed to waiting 6 years and paying $90 for Halo 4? Don't panic just yet.
There are two main ways that we will solve the problem of core proliferation. The first, and most obvious, is to improve our tools. The main reason that multiprocessing is expensive is because C and C++ suck for synchronization. There have been better ways to write parallelized software for literally decades - in the realm of functional programming languages.
Ericsson actually realized this a number of years ago; they had a need for extremely parallel and highly reliable systems. The existing technologies wouldn't cut it. So they invented their own, which has gone on to enjoy quite a bit of cult success - you may have heard of Erlang.
Within the scope of the next 5 years, developers will begin leaving C and C++. In their heydey, those languages were favorable for one primary reason: they spoke a language very close to that of contemporary hardware. They worked well with low-level concepts like memory pointers and even assembly language. They closely mirrored the design structure of the hardware itself.
However, in the past several years, that has ceased to be the case. Processors are now flooded with bizarre prediction technologies and genuinely remarkable capabilities for doing things out of order (while still getting the right result). Caches, pipelines, and predictive systems are what have gained us the bulk of processor performance in the past 8 years. Carmack makes a vital (but brief) allusion to this early in his talk.
C and C++, by contrast, have not kept pace. They still represent anachronistic models of computing and processing, which no longer speak to the hardware. They do not intrinsically understand threading, multiprocessing, or even branch prediction. Individual compilers have done a fairly reasonable job of converting C/C++ code into assembly that is aware of those concepts; initiatives like OpenMP have helped with parallelism as well.
But ultimately it will not be enough. C and C++ are dying, because their once great strength - closeness to hardware - is long gone. As the hardware continues to diverge from the outdated model of those languages, they will become progressively less relevant to mainstream development, particularly in performance-critical, close-to-the-system sectors like the game industry.
Functional languages will have their day in the sun - and it will begin in the next few years. This is virtually inevitable. The reason? Functional languages have the potential to represent the model of hardware much more closely. The benefits of functional programming for parallelism and massive concurrency are very well-understood, and well-discussed elsewhere.
However, note carefully that they have only the potential - IMHO, current functional languages aren't real great for practical development in many cases, and are very far from realizing their true potency.
Now it's shameless plug time: enter my little pet project, Epoch. The big struggle right now for Epoch is to find its killer trait, that one thing that will make it kick so much ass that it just totally wipes the floor with existing tools, and takes over the world. (I'll be honest - that's pretty much what I want to see it do.)
From the beginning one of my big emphases for the language has been the ability to think close to hardware. In my mind that is the single strength that is truly responsible for the success of C and C++ in the gaming sector, and I think that's the sector that Epoch has the best chance of succeeding in (due, in no small part, to the fact that that's where I personally can bring its influence to bear).
However, I think my original concept was off base. Initially, I thought it'd be another C - talking about memory pointers, able to call specific code addresses via function pointers, supporting inline assembly, and so on. Parallelism would be important, but I saw it as a mostly orthogonal concern. The more I look at it, though, the more I'm convinced that the opposite is in fact the case.
Allow me to paint for you a hypothetical picture. Imagine a processor which is not 16 or 32 general-purpose cores crammed together, but instead 64 or 128 highly specific mini-CPUs. Each pipeline is designed for a particular type of processing: one set of pipes may be really good at doing short tasks millions of times in a row. Another set might be good for deeply-pipelined operations like graphics processing, where a chunk of data has a lot of work done on it over a relatively long period of time, but massive globs of data are processed this way in parallel. A third might be really good at the stuff that we use SIMD instruction sets for today. Yet another set might excel at slow operations that rely on external input. Let's call all of this mini-pipelining.
So far what we've got is a classical example of ASMP, or asymmetric multi-processing, where each "node" in the set of processors is not necessarily identical to the others. Instead, each is highly specialized for a particular family of tasks - but not necessarily just one particular task, the way GPUs are currently.
Combine this with a memory architecture that can supply memory efficiently and at sufficient bandwidths to keep each mini-pipe happy. We have now eliminated processor stalling as much as possible.
At this point, we have one major question: how does one create software for this? Do we have to specifically address each mini-pipe in our code, the way we have to design for CPU-vs-GPU workloads now? (Or, for a more scary example, consider the way we have to code for processors such as Cell.) Or do we invent some kind of hyperthreading-analogue that intelligently splits code up and runs it on each micro-pipe as appropriate?
I think a HT-style approach is out, for one main reason: it would require knowledge of the code being executed on a relatively massive scale (i.e. potentially tens of thousands of instructions). While this is certainly possible to do in realtime (especially in the span of the next 5-10 years we're talking about here), it's stupid to do so - because the code is largely static. It doesn't need to get re-analyzed every time it gets run; it only needs to be analyzed once, each time it is compiled.
The natural solution is to use some kind of hinting, where the compiler emits a sort of meta-data markup on the code, telling the processor which pipe to run various tasks on. To some extent, appropriate pipes can be determined solely by which sorts of instructions the code uses, but ideally we'd see the instruction set look much more like RISC than the CISC-style we get currently. (That's a pretty involved discussion, though, so I'll gloss over it for now.)
So again, we have an important question: where does the compiler get these hints from? Does it generate them automatically from the code? Or does the programmer have a hand in adding them, ala OpenMP or threading?
The answer is nice and Zen-like: both, and neither.
Functional programming provides the perfect venue for exploring this possibility. First of all, it allows the compiler to make some very sophisticated deductions about the code being handled. Secondly, it allows the programmer to add subtle hints without expending extra effort.
Designing a multithreaded program is many times harder than designing a single-threaded one. Achieving comparable quality is hard enough; getting a genuine performance benefit is more difficult still. However, with a good, well-designed functional language, it should be possible to allow the programmer to provide subtle hints about the kinds of work he's doing, without expending extra effort. In fact, he may well save himself some time, too.
Consider a contrived and obviously overly-abstract example: if the kinds of operations done by tail-recursion can be shown (by the compiler) to belong on Minipipe A, whereas operations done by map on a certain type of data structure work best of Minipipe B, the programmer can interchange those methods (quite easily) to affect which minipipe runs a given chunk of code. If, by chance, the map approach is more sensible and leads to cleaner code, he gets a double bonus: the code is better, and it runs on pipe B, thusly being more performant. (I will conjecture that this case will happen quite often, although I have nothing more than intuition to justify that statement for the time being.)
Now that's just the crude, rudimentary level of things. Chuck in a JIT-compiler and hardware-level profiling data, and think about what that might do to code. Combine all of that with a CPU that supports JIT-style execution as well as adaptively changing the pipes that run code - what I've been mentally referring to as "adaptive mini-pipelining."
Just chew on that one for a few minutes. I suspect you'll get as excited as I have over the huge potential that such an approach can afford.
So... there's my musings on the future, and where I believe things will end up. I may have the timescales off by a significant amount, but I think the trend towards AMP-style processing is more or less inevitable. I know there's a lot of awfully definitive and sweeping statements made here, and only time will tell, but I think I've gotten a pretty decent handle on the state of affairs and its likely directions.
The question, of course, is what do you think?