Multithreading a games logic

Started by
25 comments, last by _the_phantom_ 10 years, 7 months ago

So I really want to get my game to a state where it can fully take advantage of multiple threads, and make use of the extra power, e.g. to improve view distance.

Getting things like say audio to largely use another thread is fairly straight forward, as it the slower part of auto saves (compression and writing to disk) or loading in data (e.g. additional chunks in an open world).

For rendering I went with the concept of each object having 3 blocks/copies of all the state data required to render the object, such that the render thread could read 1, the update thread could write 1, and at the end of the update the update thread can have a different 1 and leave the newly updated 1 to render next. Interpolation is then used to greatly smooth the rendering out (basically everything is currently only simulated at 20Hz).


struct RenderStateForSomeObjectType
{
    Vector3I prevPos, pos;
    float prevPitch, pitch;
    float prevYaw yaw;
    ...
};

However what I really want to do is get the heavy lifting in the logic loop to make the best use of the additional threads (since at present at least everything else is still only on average around a single core). Actually getting logic to run in multiple threads seems like an incredibly difficult thing to do.

So my idea is to run the logic as essentially 1 thread, but at each stage of the logic step if the task can be completed with multiple threads (without a tonne of locks) then to give the task out to the worker threads, then wait for the task to complete before doing the next thing.

[attachment=17524:threads.png]

Is that a good plan, or is there some easy way to actually run all the logic on 3 or 4 threads all the time, or a better way to split the work up?

Advertisement

Tread lightly.

Multithreading an application only makes sense if work can be done independently. Even if you chain everything to just a single lock, you're still basically seeing single threaded performance. For example, a self contained routine to translate an email to a language can be split to multiple threads to convert a bundle of documents quickly. But, a routine that has a dependency that contains a lock will only see single threaded performance.

So, you need to fundamentally split your engine/framework/etc (whatever you want to make use of multiple threads) into partitions of work that can be done independently. It should be done in such a way that after you give your initialization parameters it's basically a self contained thing. This could be devoting a thread to UI. There are also techniques to break down physics simulations to handle discrete portions of a scene. These can be ran in parallel.

That's not to call locks bad. They have their use. But, if you're relying on locks alone then you're setting yourself up for failure. Given this, to try to Multithread Everything (TM) is a bad approach because not everything makes sense to be multithreaded. Also, when you get into multithreaded applications fixing bugs can be very tricky since you have zero control over precisely when something is executed. Be aware you're probably going to run into some seriously funky situations. Just be patient with those.

This seems like an example of premature optimization. If you want to dive into threads because threading sounds interesting to you then go right ahead. Trying to thread a game would be a good learning experience and a solid understanding of threading will help you in many programming fields. If you want to thread simply because you want to create a game engine that will have peak performance, then I would drop the idea of threading game logic.

When you want to make a game perform better, the first step is to identify the bottle neck, or the point in the code that is using up the most time keeping you from increasing the detail, complexity, or intelligence of your game. If your goal is to improve drawing performance or increase drawing distance or adding more detail, the GPU or communication to the GPU is going to be the bottleneck. To increase the graphics performance you usually do things like move data to the GPU on load then make draw calls that use the data already on the GPU. Also, decreasing the number of draw calls by batching can improve performance when there are a lot of objects on screen. Level of detail will let the GPU quickly render objects that take up a small portion of the screen, but improve detail as things get closer. Doing tricks like that to leverage the GPU will give much better graphics performance, after all. A GPU is already 'threaded'. Top of the line GPUs now days have thousands of cores that run in parallel processing vertices and pixels. Plus, the GPU instruction set is built for doing 3D graphics so it can do the math required for 3D graphics much faster than the CPU can. Trying to use CPU threads to help out the GPU will most likely only slow down the graphics process. Doing a calculation on the CPU, that can be done faster on the GPU, means that the computation result needs to be moved to the GPU and that is a relatively slow process. If the CPU is involved with graphics, it is best for doing some sort of preprocess algorithm that is used to prepare the data once when the game is loaded to be sent to the GPU to be stored.

As for game logic. Most games don't tax the CPU with game logic. My suggestion to you, make a game, then determine if you need to optimize. If you do, then identify the bottleneck and find a better algorithm or simplify the task being done to speed it up. Improving algorithms can increase performance by orders of magnitude but threading can only increase performance linearly as the number of threads increase and you can only hope to make gains by adding threads if a large portion of your code can be run concurrently. If half your code is concurrent then at best you could cut the processing time in half, but you couldn't improve it beyond that.

I think a good use for threads in games is loading content in real time, calculating a move in chess, or another task that takes a long time but trying to thread the main game loop wont likely yield high returns.

My current game project Platform RPG

Rule 1: Only use threads on tasks that really can run concurrently.

Game logic is not likely somewhere you want to use threads, the problem with game logic is that it often depends on checking the state of other game objects and obviously if they're getting changed in unknown order you're going to have undefined behavior. Sound, networking, I/O, those are things better suited.

Rule 2: Premature optimization: the time you're spending adding complexity and coding time to your project will probably net a fractional increase over changing some code inside of the logic to be more performant. Optimize when you find something is actually causing a problem.

Prefer better code to throwing threads at a problem like some kind of muscle club. If you can move a couch with one person over 10 seconds don't spend 15 seconds getting 5 people to move the same couch in 2 seconds while they all rocket it into your doorway and crack the frame.

Well he fact that the game logic runs at 20Hz and on a 2.8GHz core still fails, something needs to shift, and I would really like to try and find solutions other than reduce the view distance.

e.g. the following bit of code is from that "Rebuild the CPU side vertex and index buffers" (which once I have I can render at about 250fps on my system, but I need to get the actual data).


bool SimpleBlockRenderer::notSolid(WorldChunk &chunk, int x, int y, int z)
{
    //Gets a blockid from the chunk if possible (correct range)
    //else falls back to World::getBlockId that looksup the chunk
    //via hashmap
    BlockId id = chunk.getWorldBlockId(x, y, z);
    const Block *block = getBlock(id);
    return !block->opaqueCube;
}
void SimpleBlockRenderer::cache(WorldChunk &chunk, ChunkRenderer &chunkRenderer, const Block *block, BlockId bid, int x, int y, int z)
{
    //Caches each of the 6 faces of a standard cube
    //At present this is the majority of the world
    //Possibility to smooth out things like grass which would obvious be more
    //demanding than this

    //The cache functions here are just putting stuff into a vertex or index
    //buffer. A small mutex protected section then hands the updated buffer to
    //the render thread, which makes or updates the ID3D11Buffer objects.

    //east west
    if (notSolid(chunk, x+1,y,z))
        cacheEast(chunk, chunkRenderer, block, bid, x, y, z);
    if (notSolid(chunk, x-1,y,z))
        cacheWest(chunk, chunkRenderer, block, bid, x, y, z);
    //top bottom
    if (notSolid(chunk, x,y+1,z))
        cacheTop(chunk, chunkRenderer, block, bid, x, y, z);
    if (notSolid(chunk, x,y-1,z))
        cacheBottom(chunk, chunkRenderer, block, bid, x, y, z);
    //north south
    if (notSolid(chunk, x,y,z+1))
        cacheNorth(chunk, chunkRenderer, block, bid, x, y, z);
    if (notSolid(chunk, x,y,z-1))
        cacheSouth(chunk, chunkRenderer, block, bid, x, y, z);
}

Running on a 2.8GHz Hex core, so around 16 is the entire core
Stack, % of CPU time
|- SimpleBlockRenderer::cache                                   7.86
|     |- SimpleBlockRenderer::notSolid                          5.73
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
|     |     |- WorldChunk::getBlockId                           1.28
|     |     |- World::getBlockId                                0.60
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00
|     |     |- ntkrnlmp.exe!KiDpcInterrupt                      0.00
|     |- SimpleBlockRenderer::cache<itself>                     1.02
|     |- SimpleBlockRenderer::cacheTop                          0.33
|     |- SimpleBlockRenderer::cacheEast                         0.17
|     |- SimpleBlockRenderer::cacheWest                         0.16
|     |- SimpleBlockRenderer::cacheNorth                        0.16
|     |- SimpleBlockRenderer::cacheBottom                       0.15
|     |- SimpleBlockRenderer::cacheSouth                        0.14
|     |- ntkrnlmp.exe!KiInterruptDispatchNoLock                 0.00

To run that over a 32x32x300 (approx height) region takes about 20ms with that code. With all the optimisations I could think of its about 16ms, so still not going to work if I want to do multiple per frame, and I do not think duplicating the entire world state (perhaps even to the GPU) is practical e.g. a 20x20 loaded region is 400 chunks, which is about 500MB of data (for id's and lightmap).

Now doing multiple chunks at once has no dependencies on each other, but the world data can not be changed in the meantime, so given say 3 chunks on average need to be recreated 16ms vs 48ms is why I want to explore the best threading options.

A similar thing stands for general updates, if I make a rule that nothing may access or modify something directly more than 32 units away, then each update of a 32x32 region has no locks provided there is a 64 unit border between regions being updated.

@HappyCoder

The above is one such bottleneck, as well as the CPU sample profiles like above, I do have this with was done quickly with QueryPerformanceCounter (to identify issues with some specific update steps that were over 200ms which my interpolation rendering didn't really like). So apart from optimising the last few ms out of some things, or reducing the data set, threads seem a good idea.


Logic update took to long. time=60.01636 target=50
//This needs rewriting anyway, since was not updated to correctly
//handle chunk borders. However I suspect this task will allways
//be fairly expensive
Lighting:          11.92671
//Checks which currently loaded chunks are needed, gets rid of unneeded ones
//and loads/creates new ones
//also checks if any background loaders have completed
Chunk Loading:     6.72902
    //Loads or creates new chunks. If the chunk allready exists will load it
    //with a background thread (decompressing the files takes a fair bit of
    //CPU).
    //Generation of new chunks is inthread limited to 1 per update step
    Ensure Loaded: 4.63302
    //Deletes unneeded chunks from memory
    Unload:        0.03959
Saving:      3.19523
    //This could be improved, since it does some disk access on the logic thread
    World          0.03459
    //Same
    Player         0.02342
    //The logic thread just creates a std::vector<uint8_t> for these, and gives
    //the vector to another thread that compresses it with zlib and writes it
    Chunks         2.93431

    //Updates all entities, scheduled block updates and random block updates
    //for all active chunks
Chunk Update:      6.42098
    //Creates NPC's, plants trees, etc. Didnt run on this frame
    Populate:      0
//Logic simply to provide the renderer with data
//e.g. new vertex buffer contents
//Entities are not included here since I used a tripple buffer lockless
//scheme at the end of the entities own updates (so included above)
//At the cost of artifacts, restricted to 2 chunks per update step
Render Update:     31.74442
    Chunk Cache:   31.45341

Given those numbers I'd say you won't gain much by threading. You need to figure out why the cache and solid checks are taking up so much time before bothering with multiple threads. I'd look at two specific things: memory access/data locality and possibly buffer locking issues. Starting with the buffer locking, you may want to double buffer the vertex buffers to prevent any blocking there. Additionally, if you don't already, lock the vertex buffer only once when starting the update and unlock when completely updated as each lock is effectively a mutex which could be expensive.

As to the memory portion, there are several options. You might want to look at a swizzle of the memory so your blocks are more localized in memory. Another option could be to process in memory order instead of jumping around so much, i.e. do x-1, x, x+1 on a single y/z cross section so if x is primary memory order you get only the initial cache hit on x-1 and the other items will likely pre-cache automatically. Heck you might even want to look into issuing prefect instructions yourself in order to prepare for each of the lines. This is probably the more complicated and difficult item to correct and there are a lot of possible issues to look at, these are just two common items.

These are just initial ideas given the information you present. There are other ways to look at this such as issuing a single lighting call over all blocks at once instead of block by block (again potentially making it such that you can iterate memory in a single linear stream) etc.

Well like I said, no locks in that code (at least not the measured code as is, just the small buffer hand off that I didn't time). The vertex buffers are access in those cache methods anyway, which while perhaps could be faster

Well the remote accesses in that is:


//I guess in <itself> the getBlock(id) which is a lookup with an array of pointers
//I did try to make a dense "solidBlocks[BLOCK_TYPE_COUNT]" boolean array, but did not see a noticeable improvement
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
//Gets the block id from a dense array
|     |     |- WorldChunk::getBlockId                           1.28
//Uses an unordered_map to get the WorldChunk, then calls the method above. Used for the edges of chunks
|     |     |- World::getBlockId                                0.60
//no idea what this is, doesnt have a significant number of samples
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00

//As for cache itself, well its just those function calls, how can I possibly reduce that 1%? Must be all function call overhead?

I think the fact that each chunks has say 32x32x300 blocks (307,200) and does 6 look-ups each (1, 843, 200). Is that ever going to be possible in the <5ms range?

I could look at the access order I guess. But wouldn't accesses on 1 or 2 of the axis always break the cache, whichever order?

I guess I could group the blocks by render type, but surely staging would cost more than the few ms saved?

Rule 1: Only use threads on tasks that really can run concurrently.

Game logic is not likely somewhere you want to use threads, the problem with game logic is that it often depends on checking the state of other game objects and obviously if they're getting changed in unknown order you're going to have undefined behavior. Sound, networking, I/O, those are things better suited.

I just wanted to reply to this little bit. Game logic is perfectly valid for multicore execution, you just need to do it correctly. I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead. Just for example's sake, say you have the following:

void Update( float deltaT )
{
  // Say I'm following something.
  Vector3f  target = mFollowTarget.mPosition;
 
  // Do movement code.
  ......
 
  // Apply movement.
  mPosition += deltaT * mVelocity;
}

This will obviously not thread correctly since it breaks the single access rule. How? Well, it breaks the rule by reading from mPosition (even though it is in a different object) and then modifying mPosition in this object. Obviously you either have to wrap the access with a mutex or figure out how to do the code such that it doesn't break the rule, in my threading system I use update stages such as follows:

  void IssueUpdate()
 {
   // Yup, a singleton, there is only one of these in a process at a time, end of subject do not
   // pass go, do not make more than one... :)
   Threading::Team::Instance()(
      [=]( float deltaT )
      {
       // Say I'm following something.
       Vector3f  target = mFollowTarget.mPosition;
 
       // Do movement code.
       ......
      }
    Threading::Team::Instance().Barrier();  // Make sure all the above calls are processed before moving on.
   Threading::Team::Instance()(
    [=]( float deltaT )
   {
    mPosition = mPosition + mVelocity * deltaT;
  } );
 }
 
// Somewhere else.
Threading::Team::Instance().Execute();

Obviously this is not exactly how you would want to do it, you really want "all" objects to issue the first part, then insert a single barrier and then all objects run the second part. But, in general you can distribute this over all cores and get nearly linear performance gains. You end up with 1 mutex per thread between the two functions (the barrier) but each object is updated without any locking requirements so you "only" have that single synchronization for the update.

Sorry, don't mean to pick on you but saying game logic is a bad choice for threading is painting with an awfully large brush.. :)

Well like I said, no locks in that code (at least not the measured code as is, just the small buffer hand off that I didn't time). The vertex buffers are access in those cache methods anyway, which while perhaps could be faster

Well the remote accesses in that is:


//I guess in <itself> the getBlock(id) which is a lookup with an array of pointers
//I did try to make a dense "solidBlocks[BLOCK_TYPE_COUNT]" boolean array, but did not see a noticeable improvement
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
//Gets the block id from a dense array
|     |     |- WorldChunk::getBlockId                           1.28
//Uses an unordered_map to get the WorldChunk, then calls the method above. Used for the edges of chunks
|     |     |- World::getBlockId                                0.60
//no idea what this is, doesnt have a significant number of samples
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00

//As for cache itself, well its just those function calls, how can I possibly reduce that 1%? Must be all function call overhead?

I think the fact that each chunks has say 32x32x300 blocks (307,200) and does 6 look-ups each (1, 843, 200). Is that ever going to be possible in the <5ms range?

I could look at the access order I guess. But wouldn't accesses on 1 or 2 of the axis always break the cache, whichever order?

I guess I could group the blocks by render type, but surely staging would cost more than the few ms saved?

You should very easily be able to hit your goal, but it requires a bit of rethink of how you process things. (And assuming this is a memory issue, hard to be absolutely sure of course.) Let's say you want to reduce the randomness of access, the first thing is to validate the order of your memory. So, assuming &data[ x=1 ] - &data[ x=0 ] is equal to your block storage size then we know everything is in primary x axis order in memory. Now, assuming &data[ y=1 ] - &data[ y=0 ] is equal to 32 units of block data the y is next in memory and of course that means z is z * x+y*xstride such that each z index is furthest away in memory of course.

So, you know that moving in "z" is the most expensive item which makes the z-1/z+1 your most expensive calls in terms of memory. Each block is interested in 3 z index's, so this suggests a fairly elegant solution is possible to minimize random access. Make 3 bit arrays of 32x32 each, now you fill those in with the IsSolid results for z=0, z=1, z=2 slices of memory. Once you have 3 layers of precached data, you can quickly look in the bit arrays to look up the results appropriately and no longer thrash about in memory so much. To move to the next layer, you simply overwrite one of the 3 layers of bit data (i.e. triple buffering) with the next layer of data and keep going until you hit the lowest layer.

This is not the "best" solution I can think of, but it was an easy one to describe and probably fairly simple to implement as a first test. Accessing the individual blocks is now done linearly and only once per block update instead of 3+1 times and you've cached the results of the test in a much smaller piece of memory which itself likely lives in cache the entire time.

It is a different way to look at the memory access and should provide significant speedup.

I just wanted to reply to this little bit. Game logic is perfectly valid for multicore execution, you just need to do it correctly. I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead.


Indeed.

The other option is to maintain 'shadow state' where by objects have a 'public' copy of data (likely a subset of internal state); anyone is free to read from it and the objects only update their internal state. At a predefined point in time all objects copy 'live' to 'shadow'.

Yes, all your reads are one frame behind but for most things this is unlikely to make a huge difference and removes any and all locks from the system. This is probably best expressed in 'tasks'.

Rendering is also a good candidate for threading as there are many stages which can be parallelised before final command submission.

For example our renderer is currently configured to have 8 stages before the final output is generated; the command list for each of these stages can be generated in parallel before final submission to the GPU.

In our case the render pipe before command list generation is also multi-stage with each stage being threaded internally via a task system as well; All in all we have 6 threads which pick up work so each stage can go wide [stage 1]-[stage2]-[stage3]-[stage4]-[scene command list generation] and it is only the final GPU submission which is single threaded.

This topic is closed to new replies.

Advertisement