~ threads ~

Started by
6 comments, last by Matt_D 13 years, 2 months ago
My program must remain compatible on Linux, Windows, and MAC.

I am using these Librarys:
OpenGL
SDL
libcUrl

I am at the point in my game where I need the map simultaneously load
while the player continues to move and update.
Right now when the map scrolls it shifts to a function that loads the new
part of the map interprets it and then creates the vertices's to draw. Then
returns to the player. Meaning when the player moves and the map
scrolls it stops for about a second. Which is way to long. If the player
happens to be between 2 pieces then both above and below keep
reloading.

I need a way for both, Map Loading and interpreting and Player Actions
to be happening simultaneously.

My game is now starting to lag a little bit since I implemented the map
loading and also some basic physics to some objects.

The solution would be threads, I plan on doing it as follows.
Threads:
0 - Leave to be used by SDL and OpenGL (in case of no GPU)
1- Load and interpret new map Chunks
2- Calculate basic Physics for objects
3- Player Movement and Actions

I am using a Amd Phenom x4 2.2Ghz Cpu and with a program with one
thread that means it can only utilize 6Mhz + 1GPU.

Threads are the solution to my problem correct?

Research I have done shows that threads are different on all Operating
systems.
How should I implement threads so that it is still compilable on Linux,
MAC and Windows without code modification?

Thanks in advance.
Eliminating the 1 second lag when the map moves is something that must be done.
Also, my apologies if some questions seem noobish because they are, but I like
to double check everything so that I can have it right.
If this post was helpful please +1 or like it !

Webstrand
Advertisement
Personally I'd combine threads 2 & 3 assuming the player's movement and actions affect / are affected by the physics engine. But there are better people on here that can advise you on threading.

I just wanted to chime in and say you should keep recently loaded chunks of map in memory, so as to handle the situation where a player is bouncing between 2 (or even 4) of them. Keep 9 chunks in memory (the one the player is currently in, and one in each of the 8 directions) and you should be good (or you could do more, but make them smaller). Maybe have a threshold of how far into a map chunk he has to travel before removing the ones behind him and loading the next batch ahead (so if he'd have to make a decent sized zig-zag before you'd be loading and unloading chunks like crazy).
--- krez ([email="krez_AT_optonline_DOT_net"]krez_AT_optonline_DOT_net[/email])
Thanks for the info Krez I think i should keep more in memory. However they still take a while to load.

Noone here has implemented threads?
If this post was helpful please +1 or like it !

Webstrand

The solution would be threads, I plan on doing it as follows.
Threads:
0 - Leave to be used by SDL and OpenGL (in case of no GPU)
1- Load and interpret new map Chunks
2- Calculate basic Physics for objects
3- Player Movement and Actions

I'd cut that down to just 2 threads. Thread 0/2/3 as the first one, and thread 1 as the second. Threading isn't easy, and combining stuff like that will reduce the number of sync points you need.


Research I have done shows that threads are different on all Operating
systems.
How should I implement threads so that it is still compilable on Linux,
MAC and Windows without code modification?
[/quote]
Try boost::thread.

Research I have done shows that threads are different on all Operating
systems.
How should I implement threads so that it is still compilable on Linux,
MAC and Windows without code modification?


SDL has a threading system built in, and it's designed for exactly the sort of thing you're doing. Task parallelism, cross-platform. The Windows implementation of SDL produces Windows threads, and the Linux implementation produces POSIX threads, so you don't have to worry about the details.

However for easier development, I'd recommend OpenMP instead. It's more modern and requires less fuss. http://en.wikipedia.org/wiki/Openmp

If you're using Visual Studio, OpenMP is installed natively with the Pro & Team editions. If you're using Visual C++ Express (2005 or later), you can still get OpenMP for free by installing the Windows 7 SDK, and enabling OpenMP in the solution properties (Configuration Properties, C/C++, Language).

Under Linux, OpenMP is supported by GCC 4.2 & up. For Mac... Dunno. Who cares.


One advantage of OpenMP is that it's mostly done through compiler directives (pragmas), so you don't actually have to change your code very much. You will have to move some things around to prevent synchronisation problems, but you won't have to write any special code for threading, which is the way it should be. Also, if you find yourself porting to a platform without OpenMP, those pragmas will just be ignored, and your project will still compile & run as old-fashioned sequential code. Same if you want a single-threaded debug build: just disable OpenMP for the debug build. No other code-changes required.

But the best thing about OpenMP is just how ridiculously easy it is to take an 'embarassingly parallel problem' and parallelise it:

#pragma omp parallel for
for(int i = 0; i < 1000000; i++)
{
do_some_work(i);
}
The approach that's often recommended, and which I happen to subscribe to for the most part, is to phrase your code in terms of tasks rather than threads. These tasks are then submitted to an execution agent (typically a thread pool) in order to be distributed across your computer's cores, whether that be 1, 2, 4 or more. Having three fixed threads on the other hand, will introduce additional context switching under heavy load on a dual core machine but will under-utilize a quad-core machine if your game is sufficiently demanding.

In the task-based approach there is no "physics thread" or "loading thread" and it's easier to write high-level parallel algorithms such as parallel_for_each, parallel_sort and so on, that will at least in principle scale to make use of whatever hardware is available.

OpenMP, as suggested by Helicobster will get you some way towards this goal and in fact might fulfil your needs entirely. But last time I checked, it doesn't explicitly provide a substrate on which to build message-passing kernels, which can also be extremely useful.

Some libraries you might be interested in include Intel's Threading Building Blocks, and Microsoft's Parallel Patterns Library.
I second Intel Threading Building Blocks. It is specifically designed to exploit parallelism for performance and does so at the cost of not supporting asynchronous tasks. That is, you should not use it to run long running or blocking tasks (as this may cause deadlock, or at least ruin the performance benefits of TBB) - for this, you should use full blown threads (SDL's threads, boost threads, pthreads or whatever).

On the other hand, TBB is suitable (and I highly recommend it) for code that you wish to parallelize for performance. That is, you can use it to prepare data for rendering, updating AI or characters or physics, preparing data to be rendered and so on in a way that makes near optimal use of available processor cores.

TBB works by splitting your code into tasks, which it then schedules to run in a thread pool in a way that keeps your cores as busy as possible. The thread pool will typically contain one thread per core to minimize context switching. Having more threads than cores will make the program run slower than it should. Having less threads than cores will mean you are not utilizing all available processing time. Also, if cores must jump between threads, it is likely that the processor cache won't contain the data you are accessing (because, presumably, other threads will be accessing different memory on a regular basis, causing the previous threads cached data to be replaced). This will greatly reduce performance as accessing main memory is slow.

Being aware of the processor cache is more important in multithreaded programs than it is in single threaded programs. In single threaded programs, keeping data in memory so you can access them in turn (locality of reference, eg arrays of data) is often enough to have the processor have data available in cache before you access it, leading to very fast memory access. Since reading a single byte from main memory actually loads an entire cache line (typically 64 bytes) from main memory into the processor cache, accessing this extra data is very cheap. The processor does this for you in the background. However, in multithreaded programs, the cache can actually hurt your performance because when you write to one cache line and then read it from another core, the cache line must be synchronized across cores. If this happens often, you may saturate the core interconnect bus, slowing down seemingly unrelated memory access. The problem is that this may happen in un-obvious ways, eg, if you write to *mem in one thread and read from, say, *(mem + 40) in another, both accesses may still be in the same cache line and force synchronization of caches. For this reason, you will want to process all data in a single cache line within the same thread, if possible. Intel Threading Building Blocks will usually make sure that threads are fed with data in a way that is cache friendly, with very little work required from the programmer.

Also, if you have many tasks, TBB will share them equally amongst threads in its thread pool. If one thread runs out of work before another, it will take some tasks from the busy thread(s) and keep the processor busy (work stealing). Scheduling tasks and work stealing is not a simple task, so I would encourage you not to try to do it yourself with a standard threading library, but instead use TBB or a similar library which already does this for you.

Besides tasks, TBB provides a number of other (higher level) abstractions and algorithms to make it easy for you to parallelize your code, including parralel_for (loop over containers or arrays where the size is known ahead of time, in parallel), parallel_while/parallel_do (loop over data where the size is not known ahead of time or until a condition is met), parallel_sort, pipeline (where data is passed from one pipeline "stage" to the next and all stages are executed in parallel, so, eg, while stage1 processes item A, stage2 may be processing item B and so on) and more. Programming in terms of tasks is much easier (and better at achieving high performance/processor utlization), but programming with these higher level algorithms is much easier still than programming in terms of tasks.

On top of that, TBB also contains a bunch of other useful tools: concurrent containers (concurrent_vector, concurrent_unordered_map (hash table), concurrent_queue), atomic variables (with atomic operations such as get_and_set, compare_and_swap, get_and_add etc), a variety of locks, a multithreading-safe high performance timer, scalable versions of malloc and free (and scalable allocators for use with STL containers) and a few other useful bits and pieces.

For asynchronous tasks, tasks that will run for a long time or tasks that will block, you will want to use full-blown threads instead of tasks. This is because if a task in TBB blocks, that thread (and because there will typically be one thread per core, that core) will not be able to run other tasks (and if the task is blocked, rather than long running, that core will be idle). TBB is designed to play nice with other threading packages, so you can go ahead and use whatever you like (eg, I use SDL's threads) for those tasks.

In my engine, I have three threads:
- The main thread is dedicated to gathering input and rendering (because SDL's input system and OpenGL are required to be run from the main thread)
- The first SDL thread is used to run game logic
- The second SDL thread is used to asynchronously load resources from disk

The main thread and the resource thread basically consist of sequential code - no additional parallelism besides that they have dedicated threads. The logic thread, however, uses TBB (especially parallel_for) to process game objects, events and so on in parallel (in an entity system, but it will work just as well for other kinds of game engines).

My engine does not yet handle audio, networking or a third party physics engine. Adding these may cause more dedicated threads to be added (either by me or by the libraries). Ideally, we would want to make sure that there isn't too many more threads than cores, though (you can tell TBB to use less threads, for example), though since some threads will block (eg, to synchronize on mutexes), it is often desirable to have _slightly_ more threads than cores. Still, until you master multicore programming, I wouldn't worry about this and just let TBB do as it pleases.

The great thing about task-based parallelism, rather than thread-based, is that your program will use enough threads to make use of however many cores future processors will have, while if you have only a fixed number of dedicated threads, you will not gain anything from having more cores than what the program was designed for.

Note: you do not have to use TBB for any of this, there are other, similar, libraries that you could choose instead, but since I use TBB, you will have to google for them yourself. I would definitely discourage you from trying to write TBB-style code manually using more traditional threading packages though, as this is a recipe for horribly slow code.
I would avoid heterogeneous thread layouts (that is, running a lot of different threads at the same time).

what you want to do, is break your game loop into "steps" and "jobs". at each "step" and "job" you want to go as wide as possible. using your threads, as a constantly running thread pool.

so, your game loop ends up like
prephysics gameplay - prephysics AI - physics - post gameplay - post AI - generate render lists - async IO.

and within each "step" there may be multiple "jobs" which get run, as "wide" as possible. so your physics step, may have a raycast resolve, an update, and a collision resolve.
not all jobs will be able to go wide (eg: UI) but some jobs will be easier to multiprocess (async IO/physics).

its pretty typical to run a separate render thread regardless, which is fed from the render list point in your game loop.

you should also aim to keep each "step" as completely dependency free as possible. and only communicate between steps using events. in each job. use lockfree queues to write results from worker jobs to.

also use "copies" of data for worker threads. and only commit all the updates when the job has finished.

this ensures that you have consistent state during processing, and gives you consistent synchronisation points during execution.

doing things this way, with a mostly homogenous thread layout, wins you massive instruction cache coherence, good predictability in terms of deadlock/livelock problems, very little actual mutex/other usage.
your never as good as they say you were, never as bad as they say you was.

This topic is closed to new replies.

Advertisement