Musings on the future, part 2

posted in The Bag of Holding

Published August 10, 2006

I'm finally going to get around to wrapping up my thoughts on programming languages and the future of processing hardware. Be sure to read part 1 first, or none of this will make sense. I'd also highly recommend Carmack's keynote speech for some additional background and validation of my ravings. In fact, I'll just flat out say that Carmack's speech is required viewing for really getting much out of this entry. So go burn a few hundred MB of bandwidth, and come back when you're done.

Ready? Good.

Let me start with multi-processing, or more accurately the challenges of parallelized programming. Current technology revolves around a model called Symmetric Multi-Processing, or SMP. SMP basically gives you a handful of identical processors, and lets you run code on them in parallel. In the early days, SMP meant literally having more than one CPU chip; nowadays, the trend is to cram the processors onto a single die, known as a multi-core architecture.

Intel's plans for the future involve what I've come to call core proliferation. The reason we are seeing such a big shift towards SMP, as Carmack outlined, is that we're close to the limit of what a single core can do. The logical place to continue getting speed boosts is to add more cores. Core proliferation, in a nutshell, is the tendency to continue adding cores as the primary means of increasing raw processing performance.

Core proliferation - and, really, the entire notion of SMP - has dire consequences for most programmers. (NB: I'm going to focus primarily on games here, partly because, well, this is GDNet, and partly because that's what I know best.) The bulk of performance-critical game code is written in C or C++ these days. The problem is, both languages really suck at writing parallelized code.

There are basically three main levels at which we can parallelize any given computing process. The first and highest level is separation of processes. An easily identifiable example of this is doing graphics on dedicated GPUs, and running game logic on the CPU. These are two different processes that can be done largely independently. Incidentally, this is also the same model that early multitasking operating systems used - divide things into (largely) unrelated processes, and each runs independently. When possible, they run in parallel. The "process" term is still in active use in OS kernels to discuss that level of organization.

The next level down is separation of steps, or (more familiarly) mulithreading. Here, we have several steps in a single process that can be done independently. Each is split into a "thread" and shunted off onto any available processor. Threading is almost exclusively done on CPUs, although programmable shaders represent a sort of analogue. The important difference here is that threads are often deeply interrelated.

The final level is separation of operations, where we actually look at the individual atomic instructions that the code executes on the CPU, and do some magic to make them run in parallel. Hyperthreading is exactly that. Back in the day, it's the tactic that Abrash and Carmack used in Quake to get performance gains there.

Now, I'm abusing terminology a little bit. Most of the time, multithreaded software is not actually divided in the separation-of-steps plane. In fact, today the majority of multithreading is done to accomplish separation of processes. As an interesting side note, this has been the cause of a not-so-minor philosophical war between various flavours of Unix and Windows; Windows tends to prefer splitting processes rarely and threads often, whereas the traditional Unix approach is to split processes (in the OS sense of the word) rather than threads.

This isn't because people are belligerent shmucks who don't like using words the same way I do. It's because of the programming languages we use.

Multithreading is, these days, more or less synonymous with separation of processes. At Egosoft, as we've looked at updating our technology for taking advantage of SMP in the future, we've approached it from the perspective of dividing up threads (and, by extension, CPUs) by processes, rather than by steps in an individual process.

The reason for this is interesting and all-important: separation of processes requires far less synchronization than separation of steps.

Let me explain a bit what I mean by that. Consider two individual processes, rendering graphics and rendering audio. The graphics process involves doing a lot of transformations, lighting, shader code, and eventually pushing around pixels. Audio requires doing some physical simulation of the environment, controlling volume levels, doing mixing, and eventually pushing raw waveform data to the sound hardware.

These processes are highly independent; the only time they really need to "sync up" is to make sure a gunshot sound goes off at the same time as the muzzle flash is drawn on the screen and the bullet flies out on it's way toward some hapless bastard's skull. Typically, these two processes are synchronized per-frame; that's the simplest way to effectively make sure they stay lined up. If graphics rendering takes longer (which it's almost guaranteed to do), the audio code waits around for a while; once they're both ready, they do the next frame's worth of work, and the cycle goes on.

By contrast, consider the act of splitting up AI into multiple threads. Suppose I've got fifty tanks all doing their thing in some virtual environment. They roam around and do stuff. Then, one tank decides to shoot at another - tank A wants to whack tank B. Unfortunately, tank B's position is currently being decided by AI thread number two, while tank A is being handled by thread number one.

What we have here is a classical synchronization problem. In order to know where to fire, tank A has to know tank B's position. But tank B is currently deciding where to move. Tank A has to wait until thread 2 finishes moving tank B, or risk a miss. However, what happens if tank A is the first tank handled by thread 1, and tank B is the last one handled by thread 2? Now thread 1 has to wait for all of thread 2 to finish running before it can start its first tank.

Now, there are some ways around this, obviously. The problem is, they are complicated, and bug-prone. I won't get into the technical side of how to solve that problem (it's actually one of the easier ones to solve), but that's a good taste of how complicated it is to split things up by steps rather than by processes. Believe me when I say it gets much worse than that - especially when you throw in the ultimate unknown: multiplayer.

For now, the answer is easy: just perform separation of processes, stick one process on each CPU (and maybe combine some of the quicker processes onto a single CPU), and you're done. You can avoid too many sticky synchronization issues, and still get a benefit from SMP.

That's a great solution, and will remain the staple way that game developers take advantage of SMP - for a while.

But eventually, if the CPU manufacturer's get their way, we will encounter core proliferation. What happens when you have 24 cores, but only 6 distinct processes? You're forced to either approach separation by steps, or waste 18 processors. But separation by steps is hard, and one of the biggest unexplored frontiers of software development today. The net result is that it will be hard to develop games - which, in turn, means development will get more expensive.

So are we all doomed to waiting 6 years and paying $90 for Halo 4? Don't panic just yet.

There are two main ways that we will solve the problem of core proliferation. The first, and most obvious, is to improve our tools. The main reason that multiprocessing is expensive is because C and C++ suck for synchronization. There have been better ways to write parallelized software for literally decades - in the realm of functional programming languages.

Ericsson actually realized this a number of years ago; they had a need for extremely parallel and highly reliable systems. The existing technologies wouldn't cut it. So they invented their own, which has gone on to enjoy quite a bit of cult success - you may have heard of Erlang.

Within the scope of the next 5 years, developers will begin leaving C and C++. In their heydey, those languages were favorable for one primary reason: they spoke a language very close to that of contemporary hardware. They worked well with low-level concepts like memory pointers and even assembly language. They closely mirrored the design structure of the hardware itself.

However, in the past several years, that has ceased to be the case. Processors are now flooded with bizarre prediction technologies and genuinely remarkable capabilities for doing things out of order (while still getting the right result). Caches, pipelines, and predictive systems are what have gained us the bulk of processor performance in the past 8 years. Carmack makes a vital (but brief) allusion to this early in his talk.

C and C++, by contrast, have not kept pace. They still represent anachronistic models of computing and processing, which no longer speak to the hardware. They do not intrinsically understand threading, multiprocessing, or even branch prediction. Individual compilers have done a fairly reasonable job of converting C/C++ code into assembly that is aware of those concepts; initiatives like OpenMP have helped with parallelism as well.

But ultimately it will not be enough. C and C++ are dying, because their once great strength - closeness to hardware - is long gone. As the hardware continues to diverge from the outdated model of those languages, they will become progressively less relevant to mainstream development, particularly in performance-critical, close-to-the-system sectors like the game industry.

Functional languages will have their day in the sun - and it will begin in the next few years. This is virtually inevitable. The reason? Functional languages have the potential to represent the model of hardware much more closely. The benefits of functional programming for parallelism and massive concurrency are very well-understood, and well-discussed elsewhere.

However, note carefully that they have only the potential - IMHO, current functional languages aren't real great for practical development in many cases, and are very far from realizing their true potency.

Now it's shameless plug time: enter my little pet project, Epoch. The big struggle right now for Epoch is to find its killer trait, that one thing that will make it kick so much ass that it just totally wipes the floor with existing tools, and takes over the world. (I'll be honest - that's pretty much what I want to see it do.)

From the beginning one of my big emphases for the language has been the ability to think close to hardware. In my mind that is the single strength that is truly responsible for the success of C and C++ in the gaming sector, and I think that's the sector that Epoch has the best chance of succeeding in (due, in no small part, to the fact that that's where I personally can bring its influence to bear).

However, I think my original concept was off base. Initially, I thought it'd be another C - talking about memory pointers, able to call specific code addresses via function pointers, supporting inline assembly, and so on. Parallelism would be important, but I saw it as a mostly orthogonal concern. The more I look at it, though, the more I'm convinced that the opposite is in fact the case.

Allow me to paint for you a hypothetical picture. Imagine a processor which is not 16 or 32 general-purpose cores crammed together, but instead 64 or 128 highly specific mini-CPUs. Each pipeline is designed for a particular type of processing: one set of pipes may be really good at doing short tasks millions of times in a row. Another set might be good for deeply-pipelined operations like graphics processing, where a chunk of data has a lot of work done on it over a relatively long period of time, but massive globs of data are processed this way in parallel. A third might be really good at the stuff that we use SIMD instruction sets for today. Yet another set might excel at slow operations that rely on external input. Let's call all of this mini-pipelining.

So far what we've got is a classical example of ASMP, or asymmetric multi-processing, where each "node" in the set of processors is not necessarily identical to the others. Instead, each is highly specialized for a particular family of tasks - but not necessarily just one particular task, the way GPUs are currently.

Combine this with a memory architecture that can supply memory efficiently and at sufficient bandwidths to keep each mini-pipe happy. We have now eliminated processor stalling as much as possible.

At this point, we have one major question: how does one create software for this? Do we have to specifically address each mini-pipe in our code, the way we have to design for CPU-vs-GPU workloads now? (Or, for a more scary example, consider the way we have to code for processors such as Cell.) Or do we invent some kind of hyperthreading-analogue that intelligently splits code up and runs it on each micro-pipe as appropriate?

I think a HT-style approach is out, for one main reason: it would require knowledge of the code being executed on a relatively massive scale (i.e. potentially tens of thousands of instructions). While this is certainly possible to do in realtime (especially in the span of the next 5-10 years we're talking about here), it's stupid to do so - because the code is largely static. It doesn't need to get re-analyzed every time it gets run; it only needs to be analyzed once, each time it is compiled.

The natural solution is to use some kind of hinting, where the compiler emits a sort of meta-data markup on the code, telling the processor which pipe to run various tasks on. To some extent, appropriate pipes can be determined solely by which sorts of instructions the code uses, but ideally we'd see the instruction set look much more like RISC than the CISC-style we get currently. (That's a pretty involved discussion, though, so I'll gloss over it for now.)

So again, we have an important question: where does the compiler get these hints from? Does it generate them automatically from the code? Or does the programmer have a hand in adding them, ala OpenMP or threading?

The answer is nice and Zen-like: both, and neither.

Functional programming provides the perfect venue for exploring this possibility. First of all, it allows the compiler to make some very sophisticated deductions about the code being handled. Secondly, it allows the programmer to add subtle hints without expending extra effort.

Designing a multithreaded program is many times harder than designing a single-threaded one. Achieving comparable quality is hard enough; getting a genuine performance benefit is more difficult still. However, with a good, well-designed functional language, it should be possible to allow the programmer to provide subtle hints about the kinds of work he's doing, without expending extra effort. In fact, he may well save himself some time, too.

Consider a contrived and obviously overly-abstract example: if the kinds of operations done by tail-recursion can be shown (by the compiler) to belong on Minipipe A, whereas operations done by map on a certain type of data structure work best of Minipipe B, the programmer can interchange those methods (quite easily) to affect which minipipe runs a given chunk of code. If, by chance, the map approach is more sensible and leads to cleaner code, he gets a double bonus: the code is better, and it runs on pipe B, thusly being more performant. (I will conjecture that this case will happen quite often, although I have nothing more than intuition to justify that statement for the time being.)

Now that's just the crude, rudimentary level of things. Chuck in a JIT-compiler and hardware-level profiling data, and think about what that might do to code. Combine all of that with a CPU that supports JIT-style execution as well as adaptively changing the pipes that run code - what I've been mentally referring to as "adaptive mini-pipelining."

Just chew on that one for a few minutes. I suspect you'll get as excited as I have over the huge potential that such an approach can afford.

So... there's my musings on the future, and where I believe things will end up. I may have the timescales off by a significant amount, but I think the trend towards AMP-style processing is more or less inevitable. I know there's a lot of awfully definitive and sweeping statements made here, and only time will tell, but I think I've gotten a pretty decent handle on the state of affairs and its likely directions.

The question, of course, is what do you think?

Previous Entry Woohoo!

Next Entry Tools: the unsung, underpaid heroes of everything

0 likes 13 comments

Comments

Rebooted

Erlang's success isn't exactly cult: it's by far the most successful FP language, and has wide usage in the industry it was designed for.

Quote:However, note carefully that they have only the potential - IMHO, current functional languages aren't real great for practical development in many cases, and are very far from realizing their true potency.

I'm fairly sure OCaml could comfortably replace C++ today. It's lacking extensive library support, but language wise, its pretty astonishing how well equipped it is. Obviously it can evolve, and it does: Acute, MetaOCaml, etc are all OCaml extensions at the cutting edge of areas of programming language research. Unfortunately ML and Haskell are left out in the cold while variants of the same imperative untyped language are worshipped. Haskell, ML and Scheme are better languages than Perl, Python or Ruby, but they lack the same buzz around them - and thats all that matters, it seems. It brings me physical pain every time I see a programming language topic somewhere like Digg - full of "programmers" who really don't know what they're talking about but dictate what gets acceptance and adoption.

August 10, 2006 06:13 PM

ApochPiQ

I guess "cult" really isn't the right word - maybe "niche." It's definitely a success, but it's far from a mainstream programming language. I'd lay odds that there are a significant number of people who would recognize the name "C++" but have never even heard of - let alone coded in - Erlang.

OCaml is a nice language; I have no doubts that based purely on merit as a language it could obliterate C++. The problem is the libraries, and libraries only happen after significant adoption occurs (or so goes my theory of Language Acceptance). If there was a reliable library for realtime graphics and a decent deployment mechanism for OCaml, I'd probably be writing games in it already. As it stands, .Net has far better deployment prospects, but still is far from being feasible as a mass-market development platform. Kind of telling, I think. [NB: not deployment only in the sense of end-user installation, but the entire production pipeline. The lack of tools on par with Visual Studio, vTune, CodeAnalyst, et. al. really hurts I think.]

Obviously what makes languages succeed today has vanishingly little to do with their genuine merits; otherwise we'd all have grown up on Lisp machines and Stroustrop wouldn't be a household name (among geek households, anyways). That's why my emphasis on Epoch has been primarily on how to get it adopted as a language, with less of a concern about the actual design of the language itself - which may seem like a paradoxically backwards approach, but I think may end up proving to be an important distinction from traditional "here's a cool language... now why the hell won't anyone use it?" scenarios.

August 10, 2006 06:24 PM

PumpkinPieman

Please bare in mind that I don't know much about this topic.

You said that some compilers have taken advances in technology in to account and try to adapt the code accordingly. Why wouldn't compilers just naturally evolve instead of scraping such an extensive language like C\C++?

August 11, 2006 09:20 AM

ApochPiQ

The problem has to do with the nature of the optimizations.

Some things are just plain outside the scope of the language to properly express. A good example of a borderline case is SIMD instruction sets; via the use of compiler-intrinsics, it is possible to enable the compiler to generate good SIMD-aware code. However, it isn't generally possible for the compiler to do so without the use of intrinsics in the first place.

Most of the optimizations available in compilers today are optimizations that can be made purely on the output of the compilation process, i.e. the compiler can do the optimization without the original code needing any changes at all. Tweaking generated code to handle branch prediction is pretty easy, for instance, because it can all be done once the code is inside the guts of the compiler. It's low-level and "small" to a certain extent.

On the flip side, consider something huge-scale like running multiple threads - there's just no way for the compiler to know the best way to divide up your program into threads. To get compiled code that is threading-aware, you have to use libraries or extensions to the language, such as OpenMP. Since the stuff is "above" the assembly level - i.e. it affects things on a grander scale than just choosing what machine opcodes to emit - it is vastly more difficult to effectively optimize in the compiler itself.

It's certainly possible to bolt on enough extra baggage to C++ that it could cope with such things. It's even conceivable that an OpenMP-style hinting markup could be created specifically to target AMP-style hardware without losing C++. Frankly, though, the language is already enough of a ravenous beast as it is; in the long term, eliminating it is actually a net gain, not a net loss.

For a close-to-home example that's already played out very nicely: think about what Java did to pointers. That approach has also been tremendously successful in .Net languages. Sure, we have compiler options now that can help to begin to detect exploits and bugs caused by pointer mismanagement - but they can only scratch the surface. By contrast, use a memory-managed language like Java or C#, and those problems are virtually eradicated.

That's the same reason why some new language is eventually going to come along and crush C/C++ in the field of parallelized programming. In fact, plenty of people would say that superior languages already exist, and have for some time; the trick now is getting mass adoption. That may take time, but it's more or less inevitable as the woeful inadequacies of C and C++ to express high-level concepts become more serious issues for real-world development.

August 11, 2006 09:55 AM

Ravuya

My firm opinion is that while Erlang and Lisp and Scheme are all excellent languages, I love using them, and they are eminently usable for videogame construction (you can see some MMOs coming up in Erlang right now for their massive parallization capabilities), your average office developer is still struggling with loops, and when they try parallelized code, they end up with this. Could you imagine forcing your average MS Access tard to write Erlang functions? Could you imagine forcing a web designer to write them?

People have been saying C++ will die for more than 10 years now. It hasn't (and neither has C). Why? I dunno. I suspect it may begin dying out now that academia has completely wussed out and started shoving Java down your throat as the answer to every problem you may have; up until last year my university didn't require you to write a single line of C to walk out of it with a Ph. D in computer science. I know that C++0x is starting much nicer threading support, but I don't know if it will launch before 2009, and I'm not even sure we'll still be living in cities as opposed to radiation-bathed Bartertown at that point.

I think the big gains are going to come from automatic parallelization; Intel has a shitload of projects on the go right now to do automatic parallelization at the runtime and compile levels, and I think they could honestly make some major in-roads here.

In the same way that memory debuggers like TotalView and libgmalloc help eliminate pointer fuckery (I still don't know why MSVS2005 doesn't ship with a memory page guard in all versions), I think that after enough years of throwing money at the problem we'll have some automatic parallelization utilities which are not only effective, but practically integrated into most imperative tools. Hell, TotalView already has a multithreaded (and network distributed) debugger which is fucking excellent to use. Apple's new Xray manages memory and multithreaded debugging quite excellently, it appears, and is receiving constant improvements from Intel's parallelization software group. Apple Shark already makes excellent opinions on how to use SIMD in your applications, and will tell you how to multithread your code in the next release. The tools are coming.

It's always a good idea to use the hardware for maximum benefit; that's why I'm trying to write multithreaded game code now for my new game engine but I don't think your average wageslave developer is going to be able to do it.

August 11, 2006 10:31 AM

ApochPiQ

I think a subtle benefit of a new language platform vs. new tools on top of C/C++ is that the "average wageslave" doesn't have to understand parallelism; the language is inherently well-suited to automating that side of the job.

Again Java is a shining example here; maybe the "average codemonkey" can't get pointers right, and maybe not, but Java just totally eliminates the entire question and saves us all a bunch of time and annoyance. (Well, except for the fact that Java sucks. Maybe I should be using C# in my examples...)

I have no doubt that tools can and will appear to help the parallelization problem in C/C++ - but that's not because it's a good idea. It's because those languages are so damn intrenched in the industry that many programmers don't know that they're making a mistake. Therein lies the main difficulty.

Replacing C++ won't be nearly so much about theoretical benefits (otherwise Lisp and OCaml would be the only things we code in anymore) but rather about evangelization and marketing. The battle isn't to create a technologically superior language; the battle is to create a sexier, more attractive language that just happens to sneak in some technical superiority on the side.

But in any case it may be moot in a few years; just as many C programmers jumped ship to Java to escape pointers, we may well realize (on a large, general-public scale) in a few years' time that synchronization by hand is dumb, and have another mass exodus to Foo Language of the Future. Naturally there will always be the masochists and macho types that write in C because it makes them feel studly (ignoring the obvious niches where C really is the best language to use). But that's nothing new.

Just another sort of half-thought to tack on here: another reason that automated parallelization in C/C++ is inferior is that it's much more static. Since automated parallelization requires access to the source code, it is much less practical to do JIT-style adaptive tweaking. However, in a functional language with suitably constructed runtimes, we can not only parallelize programs automatically, but we can do it dynamically in realtime to ensure that we always hit optimal hardware usage.

That will most likely take on the order of 10-20 years to appear for imperative-oriented technology, because it's extremely hard to do without certain provable guarantees about the code. In theory, functional languages could have been doing it for years (and may well have, and I'm just not aware of it).

August 11, 2006 04:27 PM

Rebooted

August 11, 2006 07:11 PM

Rebooted

Quote:Original post by ApochPiQ
However, in a functional language with suitably constructed runtimes, we can not only parallelize programs automatically, but we can do it dynamically in realtime to ensure that we always hit optimal hardware usage.

That will most likely take on the order of 10-20 years to appear for imperative-oriented technology, because it's extremely hard to do without certain provable guarantees about the code. In theory, functional languages could have been doing it for years (and may well have, and I'm just not aware of it).

Erlang does it.

The various tools trying to do this for C++ won't be able to compete. Automatic parallelism relies on functions being purely functional - order of evaluation cannot matter. C++ makes this incredibly hard - side effects unnecessarily litter code, complicating not just parallelism issues, but the program as a whole. Compare this to Haskell for example, where the majority of code is pure, and the type system carries information about side effects.

August 12, 2006 05:14 AM

Daerax

It seems I was able to nail many of the main points in my guesses in your last point - parallelization, concurrency, functional languages and moving hardware + software techniques.

I imagine that programmers will have to develop a new paradigm in which they consider how best to organize their code in a most such that task distribution would be most communist. I imagine too that such things as terminating loops could be intelligentlly parallelized by compiler.

Dont forget that one thing that will make your functional language most powerfl in terms of code knowledge for manipulation is static and strong typing.

August 12, 2006 05:28 AM

Rebooted

Quote:Original post by Daerax
Dont forget that one thing that will make your functional language most powerfl in terms of code knowledge for manipulation is static and strong typing.

Yes. One of the things that annoys me most is that people seem to think the fact that Python, etc lack static type systems is somehow an advantage.

August 12, 2006 06:06 AM

bytecoder

Quote:Original post by Rebooted
Quote:Original post by Daerax
Dont forget that one thing that will make your functional language most powerfl in terms of code knowledge for manipulation is static and strong typing.
Yes. One of the things that annoys me most is that people seem to think the fact that Python, etc lack static type systems is somehow an advantage.

That's a very subjective statement. I could go into a long rant about how an ideal static type system offers no benefits over an ideal dynamic system, but I'll keep my mouth shut in respect for apoch.

edit:
I cannot argue for python's type system, though. The fact that it has no notion of type anotations is really a turn-off when working with it.

August 12, 2006 02:40 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

ApochPiQ

Author

Musings on the future, part 2

Comments

ApochPiQ

Latest Entries

A Few Farewells

Code Reuse In Actual Practice

Source-Level Debugging For Epoch Programs

Using Poison to Reverse Engineer Code

Using Poison to Reverse Engineer Code

Debugging Information Success

Debugging Information Success

Debugging Epoch Programs

Debugging Epoch Programs

Epoch 64-bit compiler progress

Musings on the future, part 2

Comments

ApochPiQ

Latest Entries

A Few Farewells

Code Reuse In Actual Practice

Source-Level Debugging For Epoch Programs

Using Poison to Reverse Engineer Code

Using Poison to Reverse Engineer Code

Debugging Information Success

Debugging Information Success

Debugging Epoch Programs

Debugging Epoch Programs

Epoch 64-bit compiler progress

Reticulating splines