[C#/C++]Multithreading

Started by
16 comments, last by Xai 11 years, 2 months ago

Also worth noting that a bunch of Eve's server-side code is written in Stackless Python, and Python in general is a nightmare to multi-thread.

(there is a little doohickey called the Global Interpreter Lock, which throws a great big wrench in the works)

Good to know.

Although I've never used stackless, I was under the (obviously mistaken) impression that easier multi-threading was one of the benefits.

Guess I was wrong!

if you think programming is like sex, you probably haven't done much of either.-------------- - capn_midnight
Advertisement

Although I've never used stackless, I was under the (obviously mistaken) impression that easier multi-threading was one of the benefits.

Cooperative multi-tasking, yes. Threading no.

Cooperative multitasking-based languages like Stackless Python, Erlang, Scala, and Google's Go, support a very different model of concurrency to the C/C++/C#/Java threading model - it's worth reading up on if you are interested.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Multithreading is often misunderstood, even under devs. Multithreading is primary used for parallelism and not to speed things up. For example, in games multithreading is ideal to keep your game responsive while the game is loading some resources (for the next area), the user is doing some inputs, or the AI is calculating (re)actions.

Yes you can achieve speed ups with mt, and mt is often used for speed ups, for example the rendering in suites like 3ds or Maya. But your problem must be suited to be run in a parallel way. And in most cases the speed up is far away from a linear speed up. With a perfect linear speed up you will gain potentially 300% performance with a quad-core, this seems huge. But a linear speed up is unrealistic. You have to organize (Mutex, MVar, synchronize, STM) the different processes or threads on their meeting-points, and that results into a slow down. It's utopian that a whole game problem will gain a 300% speed up, even +100% is far away from reality. In most cases you will solve specific sub-problems with mt or, and that is the most common way, you decoupling sub-systems from each other to be run parallel on their own processing unit.

MT is often a trade-off. MT will make your project much more complex. More complexity will make your project more error-prone and will slow down the whole project progress. Your code-base is more fragile and "uglified". Whats the benefit? More responsiveness, that's fine!. 10%-30% "speed up", maybe not worth it.

I highly guess the EVE Online client makes use of parallelism, but not in that way the gamer expect. The gamer takes a look at the task manager and complains about the single cpu usage. But in which situation? Maybe this situation is not really parallelizable. For example: Pumping Data to the graphic card is not well parallelizable, sometime not even possible. Let me speculate: EVE Online parallelize the client view, the network and resource loading. In the huge fleet fight, when everything is loaded on the client-side, the bottleneck will be the rendering. And when the cpu part of the rendering isn't well parallelizable, the cpu usage is reduced to the only rendering core.

Multithreading is often misunderstood, even under devs. Multithreading is primary used for parallelism and not to speed things up. For example, in games multithreading is ideal to keep your game responsive while the game is loading some resources (for the next area), the user is doing some inputs, or the AI is calculating (re)actions.

Yes you can achieve speed ups with mt, and mt is often used for speed ups, for example the rendering in suites like 3ds or Maya. But your problem must be suited to be run in a parallel way. And in most cases the speed up is far away from a linear speed up. With a perfect linear speed up you will gain potentially 300% performance with a quad-core, this seems huge. But a linear speed up is unrealistic. You have to organize (Mutex, MVar, synchronize, STM) the different processes or threads on their meeting-points, and that results into a slow down. It's utopian that a whole game problem will gain a 300% speed up, even +100% is far away from reality. In most cases you will solve specific sub-problems with mt or, and that is the most common way, you decoupling sub-systems from each other to be run parallel on their own processing unit.

I couldn't disagree with this more. This may be true for typical GUI-tools, but not games. Games are (soft-) realtime applications meaning you've got to hit a fixed time budget per frame, consistently.

When you're making a GUI-tool, you need the GUI part to remain at an "interactive" level of responsiveness (not real-time), while you do some heavy processing over a long period of time in the background. Threads are a very convenient way to achieve this -- if you put the GUI in one, and the heavy processing in another, then the OS will ensure that each of them obtains some amount of CPU time every so often (by default on Windows: one 15ms time slice at least once every 5 seconds).

Using this same approach in a real-time application is harmful. For example, say that we're on a single-core CPU, and when we load a file into RAM we've then got to run a LZMA decompression step on the loaded data, which takes a total of 1 second. You don't want this to affect the progress of the game's 'main thread' and impact the frame-rate.

Approach 1) We put the decompression code into a separate background thread, which sleeps unless it has work to do. When it does have work to do, we're relying on the OS's thread scheduler to choose which thread is running on the single CPU core. By default on windows, the scheduler granularity is 15ms, so the decompression thread will require 67 time-slices to complete it's 1 second task. If our main thread is attempting to run at fixed real-time frame-rate of 60Hz (a limit pf 16.6ms per frame), then during the time that the decompression thread is awake, this is now impossible (unless your 'main thread' only has 1.6ms of work to do per frame). From time to time (unpredictable), the main thread will be put to sleep for an entire 15ms time-slice (or maybe multiple time-slices).

That kind of unpredictability is simply not acceptable to a real-time application.

Approach 2) We manually time-slice the decompression code, so that after it's run for ~1ms (or some other chosen threshold), it stores it's state and returns/yields -- a.k.a. cooperative multi-tasking. We run the decompression code on the "main thread" every frame, knowing that the biggest interruption that this task can have is a very predictable 1ms per frame.

As swiftcoder mentioned above, many "scripting" languages only provide these kinds of "cooperative multi-tasking threads" (often called Fibers in C++), instead of OS-level threads, and their entire purpose is to allow for concurrency of tasks.

On the other hand, OS-level threads should only be used in order to take advantage of hardware-level threads, which is only useful for gaining extra computational power. Using OS-threads for anything other than gaining access to extra hardware, in a real-time application, is an abuse of them. The exception to this is when interacting with legacy APIs that have long-blocking functions, which force you to put them into a thread.

n.b. file loading and user input aren't in this category -- your OS provides (non-blocking) asynchronous methods for these.

Post-load resource processing, and AI processing can both be time-sliced, but may also be multi-threaded if they're processor intensive.

MT is often a trade-off. MT will make your project much more complex. More complexity will make your project more error-prone and will slow down the whole project progress. Your code-base is more fragile and "uglified". Whats the benefit? More responsiveness, that's fine!. 10%-30% "speed up", maybe not worth it.

That entirely depends on the MT strategy that you choose. Many job-based strategies end up producing code that's simpler than typical C++ OOP code...

Approach 1) We put the decompression code into a separate background thread, which sleeps unless it has work to do. When it does have work to do, we're relying on the OS's thread scheduler to choose which thread is running on the single CPU core. By default on windows, the scheduler granularity is 15ms, so the decompression thread will require 67 time-slices to complete it's 1 second task. If our main thread is attempting to run at fixed real-time frame-rate of 60Hz, then during the time that the decompression thread is awake, this is now impossible. From time to time (unpredictable), the main thread will be put to sleep for an entire 15ms time-slice (or maybe multiple time-slices).
That kind of unpredictability is simply not acceptable to a real-time application.

Approach 2) We manually time-slice the decompression code, so that after it's run for ~1ms (or some other chosen threshold), it stores it's state and returns/yields -- a.k.a. cooperative multi-tasking. We run the decompression code on the "main thread" every frame, knowing that the biggest interruption that this task can have is a very predictable 1ms per frame.

I guess I expressed me wrong.

I've tried to outline this dilemma and misunderstanding. My statement was meant to be: you better don't use any mt approach to speed up your application, regardless the core count. You use mt to run things at the same time (for games in the same frame). That's independently which high/low level approach you choose. I completely agree with you, approach 1 is the worst case for a single core and approach 2 is more predictable, yes. But these approaches differ "only" in detail of the level (which is not unimportant and will have a deep impact, indeed). You showed that it's sometimes better for the application to manage its (time) resources on its own. But this added complexity to the project and shouldn't be underestimated (for example: you will loose deterministic).

And again, even (or especially) for games, you choose an mt approach not to make the game performance better. If a game dev thinks "uhm, my performance is to bad, let's switch mt on, I hope it will get better", it's the wrong motivation for mt. The best motivation to use any low or high level mt approach is, to let happen things parallel. For example: seamless environment streaming. In fact, if you choose approach 2 (aka high level mt), you will loose performance, if you measure your performance in fps-count, which is not a good performance meter and an other topic.

My statement was meant to be: you better don't use any mt approach to speed up your application, regardless the core count.

And my response was the opposite -- the only reason to use multiple threads is to gain access to extra cores, in order to speed up the application.

Concurrency (as in, interleaving two different tasks) is irrelevant -- use coroutines or fibres or manual time-slicing for that kind of concurrency. Use threads to run code on more physical cores. Ideally, your thread count matches your CPU core count, no matter how many 'concurrent' systems you have.

Ideally, a game running on a single-core CPU would only have 1 thread, and a game running on a quad core would have exactly 4 threads. The game should be able to split its workload amongst the available pool of threads automatically, and when running on the quad-core, it should be almost 4x faster than when running on a single-core. That's the ideal result, and it's not impossble.

But this added complexity to the project and shouldn't be underestimated (for example: you will loose deterministic).

There's no reason that multi-threaded programs have to give up determinism! Multi-threading strategies that introduce indeterminate behaviour are IMHO, bad strategies, in general (they may have niche applications).

One of the first models of computer that you're taught as a student is input->process->output. You've got some blob of input data, you feed it into some kind of process, and you get some blob of output data. You can then chain sequences of these blocks together in order to create an entire program. At the heart of everything that we do, this model is still relevant.

If you take all the chained IPO blocks that make up one frame of processing in your game, you've got a DAG of processes that need to be run, with dependencies between them (if the input to process #2 is the output of process #1, then process #1 must be complete before running process #2). You can perform a topological sort on this graph to get a linear order of processes, and every process that ends up being sorted to the same 'level' can be run in parallel (across multiple cores) without further synchronisation. This is how many functional programs take any old program and "automatically multi-thread" them, while maintaining perfectly deterministic behaviour.

And again, even (or especially) for games, you choose an mt approach not to make the game performance better. If a game dev thinks "uhm, my performance is to bad, let's switch mt on, I hope it will get better", it's the wrong motivation for mt.

The only reason to launch extra OS threads is because you want to make use of extra CPU cores (or you're forced to by legacy APIs), and the only reason to make use of extra CPU cores is because you need/want more processing power. As above, if you just want simple concurrency -- like background loading, streaming of environments -- you do not need extra threads.

Multi-threading it's not something you can 'switch on' later in the project, it has to be designed into the project from the beginning (when using imperative/procedural/OOP languages, anyway). Typical C++ OOP code, when decomposed into an IPO graph, looks like sphagetti code -- every process has too many side effects, and there's too much mutable state, so every process has multiple outputs all over the place. The DAG that's produced is a complex spider-web, that ends up as a serial sequence of processes with few opportunities to take advantage of multiple cores. Trying to parallelize that kind of code is a nightmare. If you really want that 300% speed boost that you mentioned (which is attainable in games, despite what many say), you need to be writing code that's well designed for a smart multi-threading strategy from the very start of your project.

If you really want that 300% speed boost that you mentioned (which is attainable in games, despite what many say), you need to be writing code that's well designed for a smart multi-threading strategy from the very start of your project.

Amusingly this code tends to end up looking more functional than anything else; a few years ago I read a book on Haskell and while the syntax hasn't stuck (because I don't use it) the way of writing code did and it made me better at writing threaded code.

The multi-threaded parts of our engine at work are very much functional in that a bunch of state goes in, is used, and a single output is produced in a buffer; this means we can scale up as far as we want or indeed scale down to a single thread for debugging.

(As we have a 'chain' of jobs we do give up some determinism with this system, mostly by allowing different processing segments of the graph run at different speeds; although these tend to be short chains of work with sync points introduced to ensure logical blocks of work are completed before moving on.)

But to the OP's original post:

1. Multithreading is HARD to use correctly and get any benifit out of, except where a program has multiple, largely independent problems to solve. Some examples where multithreading can be used easily on multicore machines to get extra performance:

You CAN run your networking or resources loading or almost anything IO bound on a different core than your main logic, so your main logic keeps "running" until the IO is finished, and then your background thread notifies your main thread of the completed IO. You CAN'T get much benefit trying to split networking itself up to 4 different cores ... they are all bound by the same reasource.

You CAN run 3 different AI algorithms on 3 different cores, IF they read from a read-only set of data, that is small enough that sharing it out to the separate cores is significantly faster than the algorithm itself. You CAN'T get much benefit trying to split 3 AIs to 3 cores if they are trying to WRITE to shared memory.

You CAN have 4 different cores generate 1/4th of a procedural generated map and then stich them together ONLY IF the map generation algorithm doesn't have to know about each decision made, to make the next one. In general to do something like this, you must design for it.

You can receive all user input on 1 thread, AI input can be generated from another thread, and these things can be sent to a 3rd thread that does the "work" of processing your game. However, the AI thread isn't really independent from the "game logic" thread because it must be blocked during the full phase the game logic thread is modifying game state. Unless you use a "double game data" technique similar to graphics double buffering. Which is almost unheard of. But there are still benifits of the 2 threads even though the AI is a blocked slave half the time. The benifit is ... it can run in parallel with other game logic slaves. So 50% (or any other amount) of time, only 1 core is doing the heavy lifiting ... then the other part of the time, each core is busy doing a separate part of the game that is driven by the game logic (for instance 1 thread drawing, 1 running AI, 1 sending network info, etc).

This topic is closed to new replies.

Advertisement