•      Sign In
• Create Account

We're offering banner ads on our site from just \$5!

# Multithreading a games logic

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

26 replies to this topic

### #1SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 21 August 2013 - 10:53 AM

So I really want to get my game to a state where it can fully take advantage of multiple threads, and make use of the extra power, e.g. to improve view distance.

Getting things like say audio to largely use another thread is fairly straight forward, as it the slower part of auto saves (compression and writing to disk) or loading in data (e.g. additional chunks in an open world).

For rendering I went with the concept of each object having 3 blocks/copies of all the state data required to render the object, such that the render thread could read 1, the update thread could write 1, and at the end of the update the update thread can have a different 1 and leave the newly updated 1 to render next. Interpolation is then used to greatly smooth the rendering out (basically everything is currently only simulated at 20Hz).

struct RenderStateForSomeObjectType
{
Vector3I prevPos, pos;
float prevPitch, pitch;
float prevYaw yaw;
...
};


However what I really want to do is get the heavy lifting in the logic loop to make the best use of the additional threads (since at present at least everything else is still only on average around a single core). Actually getting logic to run in multiple threads seems like an incredibly difficult thing to do.

So my idea is to run the logic as essentially 1 thread, but at each stage of the logic step if the task can be completed with multiple threads (without a tonne of locks) then to give the task out to the worker threads, then wait for the task to complete before doing the next thing.

Is that a good plan, or is there some easy way to actually run all the logic on 3 or 4 threads all the time, or a better way to split the work up?

Sponsor:

### #2hrmmm  Members   -  Reputation: 110

Like
1Likes
Like

Posted 21 August 2013 - 11:26 AM

Tread lightly.

Multithreading an application only makes sense if work can be done independently. Even if you chain everything to just a single lock, you're still basically seeing single threaded performance. For example, a self contained routine to translate an email to a language can be split to multiple threads to convert a bundle of documents quickly. But, a routine that has a dependency that contains a lock will only see single threaded performance.

So, you need to fundamentally split your engine/framework/etc (whatever you want to make use of multiple threads) into partitions of work that can be done independently. It should be done in such a way that after you give your initialization parameters it's basically a self contained thing. This could be devoting a thread to UI. There are also techniques to break down physics simulations to handle discrete portions of a scene. These can be ran in parallel.

That's not to call locks bad. They have their use. But, if you're relying on locks alone then you're setting yourself up for failure. Given this, to try to Multithread Everything ™ is a bad approach because not everything makes sense to be multithreaded. Also, when you get into multithreaded applications fixing bugs can be very tricky since you have zero control over precisely when something is executed. Be aware you're probably going to run into some seriously funky situations. Just be patient with those.

### #3HappyCoder  Members   -  Reputation: 2843

Like
0Likes
Like

Posted 21 August 2013 - 12:21 PM

This seems like an example of premature optimization. If you want to dive into threads because threading sounds interesting to you then go right ahead. Trying to thread a game would be a good learning experience and a solid understanding of threading will help you in many programming fields. If you want to thread simply because you want to create a game engine that will have peak performance, then I would drop the idea of threading game logic.

When you want to make a game perform better, the first step is to identify the bottle neck, or the point in the code that is using up the most time keeping you from increasing the detail, complexity, or intelligence of your game. If your goal is to improve drawing performance or increase drawing distance or adding more detail, the GPU or communication to the GPU is going to be the bottleneck. To increase the graphics performance you usually do things like move data to the GPU on load then make draw calls that use the data already on the GPU. Also, decreasing the number of draw calls by batching can improve performance when there are a lot of objects on screen. Level of detail will let the GPU quickly render objects that take up a small portion of the screen, but improve detail as things get closer. Doing tricks like that to leverage the GPU will give much better graphics performance, after all. A GPU is already 'threaded'. Top of the line GPUs now days have thousands of cores that run in parallel processing vertices and pixels. Plus, the GPU instruction set is built for doing 3D graphics so it can do the math required for 3D graphics much faster than the CPU can. Trying to use CPU threads to help out the GPU will most likely only slow down the graphics process. Doing a calculation on the CPU, that can be done faster on the GPU, means that the computation result needs to be moved to the GPU and that is a relatively slow process. If the CPU is involved with graphics, it is best for doing some sort of preprocess algorithm that is used to prepare the data once when the game is loaded to be sent to the GPU to be stored.

As for game logic. Most games don't tax the CPU with game logic. My suggestion to you, make a game, then determine if you need to optimize. If you do, then identify the bottleneck and find a better algorithm or simplify the task being done to speed it up. Improving algorithms can increase performance by orders of magnitude but threading can only increase performance linearly as the number of threads increase and you can only hope to make gains by adding threads if a large portion of your code can be run concurrently. If half your code is concurrent then at best you could cut the processing time in half, but you couldn't improve it beyond that.

I think a good use for threads in games is loading content in real time, calculating a move in chess, or another task that takes a long time but trying to thread the main game loop wont likely yield high returns.

### #4Satharis  Members   -  Reputation: 1060

Like
0Likes
Like

Posted 21 August 2013 - 01:29 PM

Rule 1: Only use threads on tasks that really can run concurrently.

Game logic is not likely somewhere you want to use threads, the problem with game logic is that it often depends on checking the state of other game objects and obviously if they're getting changed in unknown order you're going to have undefined behavior. Sound, networking, I/O, those are things better suited.

Rule 2: Premature optimization: the time you're spending adding complexity and coding time to your project will probably net a fractional increase over changing some code inside of the logic to be more performant. Optimize when you find something is actually causing a problem.

Prefer better code to throwing threads at a problem like some kind of muscle club. If you can move a couch with one person over 10 seconds don't spend 15 seconds getting 5 people to move the same couch in 2 seconds while they all rocket it into your doorway and crack the frame.

### #5SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 21 August 2013 - 01:52 PM

Well he fact that the game logic runs at 20Hz and on a 2.8GHz core still fails, something needs to shift, and I would really like to try and find solutions other than reduce the view distance.

e.g. the following bit of code is from that "Rebuild the CPU side vertex and index buffers" (which once I have I can render at about 250fps on my system, but I need to get the actual data).

bool SimpleBlockRenderer::notSolid(WorldChunk &chunk, int x, int y, int z)
{
//Gets a blockid from the chunk if possible (correct range)
//else falls back to World::getBlockId that looksup the chunk
//via hashmap
BlockId id = chunk.getWorldBlockId(x, y, z);
const Block *block = getBlock(id);
return !block->opaqueCube;
}
void SimpleBlockRenderer::cache(WorldChunk &chunk, ChunkRenderer &chunkRenderer, const Block *block, BlockId bid, int x, int y, int z)
{
//Caches each of the 6 faces of a standard cube
//At present this is the majority of the world
//Possibility to smooth out things like grass which would obvious be more
//demanding than this

//The cache functions here are just putting stuff into a vertex or index
//buffer. A small mutex protected section then hands the updated buffer to
//the render thread, which makes or updates the ID3D11Buffer objects.

//east west
if (notSolid(chunk, x+1,y,z))
cacheEast(chunk, chunkRenderer, block, bid, x, y, z);
if (notSolid(chunk, x-1,y,z))
cacheWest(chunk, chunkRenderer, block, bid, x, y, z);
//top bottom
if (notSolid(chunk, x,y+1,z))
cacheTop(chunk, chunkRenderer, block, bid, x, y, z);
if (notSolid(chunk, x,y-1,z))
cacheBottom(chunk, chunkRenderer, block, bid, x, y, z);
//north south
if (notSolid(chunk, x,y,z+1))
cacheNorth(chunk, chunkRenderer, block, bid, x, y, z);
if (notSolid(chunk, x,y,z-1))
cacheSouth(chunk, chunkRenderer, block, bid, x, y, z);
}

Running on a 2.8GHz Hex core, so around 16 is the entire core
Stack, % of CPU time
|- SimpleBlockRenderer::cache                                   7.86
|     |- SimpleBlockRenderer::notSolid                          5.73
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
|     |     |- WorldChunk::getBlockId                           1.28
|     |     |- World::getBlockId                                0.60
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00
|     |     |- ntkrnlmp.exe!KiDpcInterrupt                      0.00
|     |- SimpleBlockRenderer::cache<itself>                     1.02
|     |- SimpleBlockRenderer::cacheTop                          0.33
|     |- SimpleBlockRenderer::cacheEast                         0.17
|     |- SimpleBlockRenderer::cacheWest                         0.16
|     |- SimpleBlockRenderer::cacheNorth                        0.16
|     |- SimpleBlockRenderer::cacheBottom                       0.15
|     |- SimpleBlockRenderer::cacheSouth                        0.14
|     |- ntkrnlmp.exe!KiInterruptDispatchNoLock                 0.00



To run that over a 32x32x300 (approx height) region takes about 20ms with that code. With all the optimisations I could think of its about 16ms, so still not going to work if I want to do multiple per frame, and I do not think duplicating the entire world state (perhaps even to the GPU) is practical e.g. a 20x20 loaded region is 400 chunks, which is about 500MB of data (for id's and lightmap).

Now doing multiple chunks at once has no dependencies on each other, but the world data can not be changed in the meantime, so given say 3 chunks on average need to be recreated 16ms vs 48ms is why I want to explore the best threading options.

A similar thing stands for general updates, if I make a rule that nothing may access or modify something directly more than 32 units away, then each update of a 32x32 region has no locks provided there is a 64 unit border between regions being updated.

@HappyCoder

The above is one such bottleneck, as well as the CPU sample profiles like above, I do have this with was done quickly with QueryPerformanceCounter (to identify issues with some specific update steps that were over 200ms which my interpolation rendering didn't really like). So apart from optimising the last few ms out of some things, or reducing the data set, threads seem a good idea.

Logic update took to long. time=60.01636 target=50
//This needs rewriting anyway, since was not updated to correctly
//handle chunk borders. However I suspect this task will allways
//be fairly expensive
Lighting:          11.92671
//Checks which currently loaded chunks are needed, gets rid of unneeded ones
//and loads/creates new ones
//also checks if any background loaders have completed
Chunk Loading:     6.72902
//Loads or creates new chunks. If the chunk allready exists will load it
//with a background thread (decompressing the files takes a fair bit of
//CPU).
//Generation of new chunks is inthread limited to 1 per update step
Ensure Loaded: 4.63302
//Deletes unneeded chunks from memory
Unload:        0.03959
Saving:      3.19523
//This could be improved, since it does some disk access on the logic thread
World          0.03459
//Same
Player         0.02342
//The logic thread just creates a std::vector<uint8_t> for these, and gives
//the vector to another thread that compresses it with zlib and writes it
Chunks         2.93431

//Updates all entities, scheduled block updates and random block updates
//for all active chunks
Chunk Update:      6.42098
//Creates NPC's, plants trees, etc. Didnt run on this frame
Populate:      0
//Logic simply to provide the renderer with data
//e.g. new vertex buffer contents
//Entities are not included here since I used a tripple buffer lockless
//scheme at the end of the entities own updates (so included above)
//At the cost of artifacts, restricted to 2 chunks per update step
Render Update:     31.74442
Chunk Cache:   31.45341


Edited by SyncViews, 21 August 2013 - 02:18 PM.

### #6AllEightUp  Moderators   -  Reputation: 4241

Like
0Likes
Like

Posted 21 August 2013 - 03:22 PM

Given those numbers I'd say you won't gain much by threading.  You need to figure out why the cache and solid checks are taking up so much time before bothering with multiple threads.  I'd look at two specific things: memory access/data locality and possibly buffer locking issues.  Starting with the buffer locking, you may want to double buffer the vertex buffers to prevent any blocking there.  Additionally, if you don't already, lock the vertex buffer only once when starting the update and unlock when completely updated as each lock is effectively a mutex which could be expensive.

As to the memory portion, there are several options.  You might want to look at a swizzle of the memory so your blocks are more localized in memory.  Another option could be to process in memory order instead of jumping around so much, i.e. do x-1, x, x+1 on a single y/z cross section so if x is primary memory order you get only the initial cache hit on x-1 and the other items will likely pre-cache automatically.  Heck you might even want to look into issuing prefect instructions yourself in order to prepare for each of the lines.  This is probably the more complicated and difficult item to correct and there are a lot of possible issues to look at, these are just two common items.

These are just initial ideas given the information you present.  There are other ways to look at this such as issuing a single lighting call over all blocks at once instead of block by block (again potentially making it such that you can iterate memory in a single linear stream) etc.

### #7SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 21 August 2013 - 03:36 PM

Well like I said, no locks in that code (at least not the measured code as is, just the small buffer hand off that I didn't time). The vertex buffers are access in those cache methods anyway, which while perhaps could be faster

Well the remote accesses in that is:

//I guess in <itself> the getBlock(id) which is a lookup with an array of pointers
//I did try to make a dense "solidBlocks[BLOCK_TYPE_COUNT]" boolean array, but did not see a noticeable improvement
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
//Gets the block id from a dense array
|     |     |- WorldChunk::getBlockId                           1.28
//Uses an unordered_map to get the WorldChunk, then calls the method above. Used for the edges of chunks
|     |     |- World::getBlockId                                0.60
//no idea what this is, doesnt have a significant number of samples
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00

//As for cache itself, well its just those function calls, how can I possibly reduce that 1%? Must be all function call overhead?

I think the fact that each chunks has say 32x32x300 blocks (307,200) and does 6 look-ups each (1, 843, 200). Is that ever going to be possible in the <5ms range?

I could look at the access order I guess. But wouldn't accesses on 1 or 2 of the axis always break the cache, whichever order?

I guess I could group the blocks by render type, but surely staging would cost more than the few ms saved?

Edited by SyncViews, 21 August 2013 - 03:37 PM.

### #8AllEightUp  Moderators   -  Reputation: 4241

Like
0Likes
Like

Posted 21 August 2013 - 03:38 PM

Rule 1: Only use threads on tasks that really can run concurrently.

Game logic is not likely somewhere you want to use threads, the problem with game logic is that it often depends on checking the state of other game objects and obviously if they're getting changed in unknown order you're going to have undefined behavior. Sound, networking, I/O, those are things better suited.

I just wanted to reply to this little bit.  Game logic is perfectly valid for multicore execution, you just need to do it correctly.  I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead.  Just for example's sake, say you have the following:

void Update( float deltaT )
{
// Say I'm following something.
Vector3f  target = mFollowTarget.mPosition;

// Do movement code.
......

// Apply movement.
mPosition += deltaT * mVelocity;
}


This will obviously not thread correctly since it breaks the single access rule.  How?  Well, it breaks the rule by reading from mPosition (even though it is in a different object) and then modifying mPosition in this object.  Obviously you either have to wrap the access with a mutex or figure out how to do the code such that it doesn't break the rule, in my threading system I use update stages such as follows:

  void IssueUpdate()
{
// Yup, a singleton, there is only one of these in a process at a time, end of subject do not
// pass go, do not make more than one... :)
Threading::Team::Instance()(
[=]( float deltaT )
{
// Say I'm following something.
Vector3f  target = mFollowTarget.mPosition;

// Do movement code.
......
}
Threading::Team::Instance().Barrier();  // Make sure all the above calls are processed before moving on.
Threading::Team::Instance()(
[=]( float deltaT )
{
mPosition = mPosition + mVelocity * deltaT;
} );
}

// Somewhere else.
Threading::Team::Instance().Execute();


Obviously this is not exactly how you would want to do it, you really want "all" objects to issue the first part, then insert a single barrier and then all objects run the second part.  But, in general you can distribute this over all cores and get nearly linear performance gains.  You end up with 1 mutex per thread between the two functions (the barrier) but each object is updated without any locking requirements so you "only" have that single synchronization for the update.

Sorry, don't mean to pick on you but saying game logic is a bad choice for threading is painting with an awfully large brush..

### #9AllEightUp  Moderators   -  Reputation: 4241

Like
1Likes
Like

Posted 21 August 2013 - 04:03 PM

Well like I said, no locks in that code (at least not the measured code as is, just the small buffer hand off that I didn't time). The vertex buffers are access in those cache methods anyway, which while perhaps could be faster

Well the remote accesses in that is:

//I guess in <itself> the getBlock(id) which is a lookup with an array of pointers
//I did try to make a dense "solidBlocks[BLOCK_TYPE_COUNT]" boolean array, but did not see a noticeable improvement
|     |     |- SimpleBlockRenderer::notSolid<itself>            3.86
//Gets the block id from a dense array
|     |     |- WorldChunk::getBlockId                           1.28
//Uses an unordered_map to get the WorldChunk, then calls the method above. Used for the edges of chunks
|     |     |- World::getBlockId                                0.60
//no idea what this is, doesnt have a significant number of samples
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock           0.00

//As for cache itself, well its just those function calls, how can I possibly reduce that 1%? Must be all function call overhead?

I think the fact that each chunks has say 32x32x300 blocks (307,200) and does 6 look-ups each (1, 843, 200). Is that ever going to be possible in the <5ms range?

I could look at the access order I guess. But wouldn't accesses on 1 or 2 of the axis always break the cache, whichever order?

I guess I could group the blocks by render type, but surely staging would cost more than the few ms saved?

You should very easily be able to hit your goal, but it requires a bit of rethink of how you process things.  (And assuming this is a memory issue, hard to be absolutely sure of course.)  Let's say you want to reduce the randomness of access, the first thing is to validate the order of your memory.  So, assuming &data[ x=1 ] - &data[ x=0 ] is equal to your block storage size then we know everything is in primary x axis order in memory.  Now, assuming &data[ y=1 ] - &data[ y=0 ] is equal to 32 units of block data the y is next in memory and of course that means z is z * x+y*xstride such that each z index is furthest away in memory of course.

So, you know that moving in "z" is the most expensive item which makes the z-1/z+1 your most expensive calls in terms of memory.  Each block is interested in 3 z index's, so this suggests a fairly elegant solution is possible to minimize random access.  Make 3 bit arrays of 32x32 each, now you fill those in with the IsSolid results for z=0, z=1, z=2 slices of memory.  Once you have 3 layers of precached data, you can quickly look in the bit arrays to look up the results appropriately and no longer thrash about in memory so much.  To move to the next layer, you simply overwrite one of the 3 layers of bit data (i.e. triple buffering) with the next layer of data and keep going until you hit the lowest layer.

This is not the "best" solution I can think of, but it was an easy one to describe and probably fairly simple to implement as a first test.  Accessing the individual blocks is now done linearly and only once per block update instead of 3+1 times and you've cached the results of the test in a much smaller piece of memory which itself likely lives in cache the entire time.

It is a different way to look at the memory access and should provide significant speedup.

Edited by AllEightUp, 21 August 2013 - 04:06 PM.

### #10phantom  Moderators   -  Reputation: 7433

Like
0Likes
Like

Posted 21 August 2013 - 04:03 PM

I just wanted to reply to this little bit.  Game logic is perfectly valid for multicore execution, you just need to do it correctly.  I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead.

Indeed.

The other option is to maintain 'shadow state' where by objects have a 'public' copy of data (likely a subset of internal state); anyone is free to read from it and the objects only update their internal state. At a predefined point in time all objects copy 'live' to 'shadow'.

Yes, all your reads are one frame behind but for most things this is unlikely to make a huge difference and removes any and all locks from the system. This is probably best expressed in 'tasks'.

Rendering is also a good candidate for threading as there are many stages which can be parallelised before final command submission.

For example our renderer is currently configured to have 8 stages before the final output is generated; the command list for each of these stages can be generated in parallel before final submission to the GPU.

In our case the render pipe before command list generation is also multi-stage with each stage being threaded internally via a task system as well; All in all we have 6 threads which pick up work so each stage can go wide [stage 1]-[stage2]-[stage3]-[stage4]-[scene command list generation] and it is only the final GPU submission which is single threaded.

### #11AllEightUp  Moderators   -  Reputation: 4241

Like
0Likes
Like

Posted 21 August 2013 - 04:37 PM

Indeed.

The other option is to maintain 'shadow state' where by objects have a 'public' copy of data (likely a subset of internal state); anyone is free to read from it and the objects only update their internal state. At a predefined point in time all objects copy 'live' to 'shadow'.

Yes, all your reads are one frame behind but for most things this is unlikely to make a huge difference and removes any and all locks from the system. This is probably best expressed in 'tasks'.

Yeah, I actually use shadow states myself but I still use the barrier solution for "system" update separation, I just figured the outline was a nice simple example of the different method of looking at things.  Anyway, the better example would be move all the objects in one shot using shadows, barrier, apply all shadows, barrier, update sweep/prune awareness system, culling, and anything else not interdependent, barrier, issue rendering etc etc.  You still need the concept of the barrier to separate those updates when appropriate but it is still a massive reduction in locking/blocking.

As to the "task" system, I have a different way to look at it.  Some items distribute differently than others, so I supply a generic concept of "distributor" which will simply be called by all threads in the team.  So, an object update distributor simply divides up the updates to be called among all threads and lets them go to town, it doesn't care if they come back at different rates and there is no blocking, just one atomic to track which sections of the array have been issued to each thread.  Threads come out and start executing the next distributor which in the outline would be a barrier since we want to make sure all objects are updated prior to applying the shadows.  (Assuming the tasks are side by side as in the example.)  On the other hand, I have a sweep prune awareness system which is currently only multicored enough to use 3 threads right now, so that uses a custom distributor which grabs the first three threads which enter and lets all the remaining ones pass through, those threads could go onto say update the scene graph/culling since it is not reliant on the sweep prune nor affected by it so they end up running in parallel.

As to the reads being a frame behind, I prefer it that way since it is a consistent guarantee which can be made and at worst if you absolutely have to be looking at "now" you can just extrapolate 1 approx frame ahead which is usually fairly accurate at such a fine grain delta.  Order issues in object dependencies is always a nightmare in traditional game loops, that object updated first but depended on this object, next frame they may be reversed and you end up with nasty little oscillations in things like a follow state.

### #12SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 21 August 2013 - 04:40 PM

Well the render thread (using the lockless access of entity data with a triple buffering scheme, and those prepared static vertex data manages 400+ FPS if I don't cap it. Profiling shows 6% in WinMain (the rendering thread), plus some 10% in "atidxx64.dll", no idea what that is, or what thread it even is, its just another item under "kernel32.dll!BaseThreadInitThunk" and they seem to add upto the 16% CPU available to the thread.

I suppose I could optimise that, e.g. simply cap the frame rate, but theres is still another 4 cores to use on this box before reducing CPU usage of some thread seems to be a concern. Certainly don't see the need for multi-cpre rendering, at least in the sense I normally see advertised doesn't help me create the buffers any better. So really want to look at the logic side since it is constantly missing its 20Hz goal whenever things happen that needs lights and static buffers rebuilt.

Is there some kind of counter that says how often the CPU stopped because it needed data from main memory? I think i read Intel have some sort of internal performance counters, not sure about AMD...

I'll give the solid buffers thing a go in the morning, since that seems to be the biggest cost since the vast majority of faces are not visible. I guess I need the buffers to be side length + 2 long to handle the edges, but should be easy enough to fill in a loop. Will check which order I put the buffer in as well (have a feeling its x * a + y * b + z, or possibly y first for some reason).

So basically if it is a memory cache/bandwidth issue running that block of code (well the code that does the 32x32xN chunk) on 4 threads at once won't give a 4x speedup, even if there are no locks for any threads, just a "wait for everything to be done" before continuing on after (e.g. like below)?


for (auto it = chunks.begin(); it != chunks.end(); ++it)
{
auto chunk = *it;
if (chunk->hasRendererBlockChanged() && chunk->shouldRender())
{
// chunk->getRenderer().buildCache();
// run with existing thread in the pool
// (idle threads will just wait on a WaitForSingleObject or similar)
// read only on the world/chunk data, just writes to chunk->getRenderer() which is a per
// chunk object
threadPool.run(
std::bind(
&ChunkRenderer::buildCache,
&chunk->getRenderer()));
}
}
//wait for all queued tasks to complete on this pool
//perhaps let it use this thread as well to help, although guess it does not matter
//if it does or blocks the thread, if the pool has the right number of threads internally
//(just the run calls above)
threadPool.wait();


Edited by SyncViews, 21 August 2013 - 04:41 PM.

### #13AllEightUp  Moderators   -  Reputation: 4241

Like
0Likes
Like

Posted 21 August 2013 - 05:06 PM

Well the render thread (using the lockless access of entity data with a triple buffering scheme, and those prepared static vertex data manages 400+ FPS if I don't cap it. Profiling shows 6% in WinMain (the rendering thread), plus some 10% in "atidxx64.dll", no idea what that is, or what thread it even is, its just another item under "kernel32.dll!BaseThreadInitThunk" and they seem to add upto the 16% CPU available to the thread.

The first item 'atidxx64.dll' is the actual driver for your video card. It is possible that it is completely valid but 10% seems a bit high so I'd go back and double check your vertex buffer accesses are double buffered and all that. Using Pix (or whatever) is likely to be the only way to figure out the details of your utilization though. The kernel calls are generally bad items also, as an experiment (and just a rough idea of how bad), you can bring up task manager, go to performance tab, and select view/show kernel times. Run your project and if the red line is anywhere near the green line, your problems are "likely" in the graphics api usage and/or other threading issues such as massive blocking on a mutex somewhere. (This is just a very rough way to get some extra data, no promises how accurate it is. )

I suppose I could optimise that, e.g. simply cap the frame rate, but theres is still another 4 cores to use on this box before reducing CPU usage of some thread seems to be a concern. Certainly don't see the need for multi-cpre rendering, at least in the sense I normally see advertised doesn't help me create the buffers any better. So really want to look at the logic side since it is constantly missing its 20Hz goal whenever things happen that needs lights and static buffers rebuilt.

Is there some kind of counter that says how often the CPU stopped because it needed data from main memory? I think i read Intel have some sort of internal performance counters, not sure about AMD...

There are quite a few counters available, unfortunately accessing them is a pain in the ass or requires something like VTune. There used to be a library which worked on Windows called Papi, but they removed the windows support some time back I believe.

I'll give the solid buffers thing a go in the morning, since that seems to be the biggest cost since the vast majority of faces are not visible. I guess I need the buffers to be side length + 2 long to handle the edges, but should be easy enough to fill in a loop. Will check which order I put the buffer in as well (have a feeling its x * a + y * b + z, or possibly y first for some reason).

So basically if it is a memory cache/bandwidth issue running that block of code (well the code that does the 32x32xN chunk) on 4 threads at once won't give a 4x speedup, even if there are no locks for any threads, just a "wait for everything to be done" before continuing on after (e.g. like below)?

Well, as with anything, you could get some benefit by multi-threading it but unfortunately that is unlikely to help until you know exactly where the slowdown is. I'd run the rough test looking at kernel time first before further exploring this, if that kernel time is notable then my suggestions would be to go back to looking at your API usages.

Edited by AllEightUp, 21 August 2013 - 05:07 PM.

### #14SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 21 August 2013 - 05:33 PM

Well I guessed it was the display driver, but still dont see why it shows up in the xperf stack trace there, why not inside a ID3D11DeviceContext->XYZ call? Its as if it has its own thread, blocks mine then runs its for a bit? I guess at 400FPS though it could just be back buffer management stuff in practice, like I mentioned in some other thread I doubt MS or AMD/Nvidia/Intel optimised all the things that should be once per frame anyway. If I cap it to 60FPS, both bits of code go down to like 3% in xperf.

What do you mean by double buffered vertex buffers though? I just have some ID3D11Buffer objects which I put data into as part of the CreateBuffer call, or ocassionally Map/Unmap (mainly constant buffers containing a matrix for some object). I never read from the ID3D11Buffer, although I guess it is possible if I specified with cpu read flag.

As for the kernel calls, will take a look. A 50 second xperf results only had a few samples in them, not sure exactly how those work and how xperf would interact though. Actually on inspection almost looks like ETW/xperf detecting itself

   |     |     |     |     |     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock (2 samples in 50 seconds)
|     |     |     |     |     |     |- ntkrnlmp.exe!KiDpcInterrupt (1 sample in 50 seconds)
|     |     |     |     |     |     |     ntkrnlmp.exe!KxDispatchInterrupt
|     |     |     |     |     |     |     ntkrnlmp.exe!SwapContext_PatchXRstor
|     |     |     |     |     |     |     ntkrnlmp.exe!EtwTraceContextSwap



Task manager kernel line seems to fluctuate between 1/8 and 1/6 of load (approx 30%, approx 3.75% to 5% kernel), allthough hard to know how much of that is my doing. Without my app running just all the other stuff looks like 1 to 5%.

Was just using "xperf -on latency -stackwalk profile" for the profiles, I recall there are a bunch of other things but can't really remember them, just been using that from a .bat I made like a year ago. Might go look through the list again.

Edited by SyncViews, 21 August 2013 - 05:36 PM.

### #15Satharis  Members   -  Reputation: 1060

Like
0Likes
Like

Posted 21 August 2013 - 11:07 PM

I just wanted to reply to this little bit.  Game logic is perfectly valid for multicore execution, you just need to do it correctly.  I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead.  Just for example's sake, say you have the following:

-snip-

Oh I don't mind, it's true game logic can be multithreaded depending on the case but like you said it is an awfully large brush for the thing in general. Game logic can cover a lot of things from world interactions to AI and what can be threaded surely depends a lot on the game.

To me it's something you would only really want to consider when you exhaust other options because it has so many points of possible clash unlike something like IO or sound, which are more easily seperated out and tend to burn cpu time doing moderate work.

But you didn't suggest jumping right to multithreading, which is good, I definitely think exhausting other options is helpful before throwing threads at something like a giant whip like people tend to do.

Good ideas though.

### #16AllEightUp  Moderators   -  Reputation: 4241

Like
1Likes
Like

Posted 21 August 2013 - 11:17 PM

Well I guessed it was the display driver, but still dont see why it shows up in the xperf stack trace there, why not inside a ID3D11DeviceContext->XYZ call? Its as if it has its own thread, blocks mine then runs its for a bit?

Actually, that is exactly the case, normally. The thread is for the driver and the reason it doesn't show up is usually because of the kernel transition in front of whatever call is causing the block. As a kernel thread it is a lighter weight version of a user thread and doesn't follow all the same rules. This of course tends to break callstacks since there is poor translation between the two forms of threads. (NOTE: Been nearly 8 years since I've messed with kernel stuff, perhaps things have changed a bit but don't think so.)

I guess at 400FPS though it could just be back buffer management stuff in practice, like I mentioned in some other thread I doubt MS or AMD/Nvidia/Intel optimised all the things that should be once per frame anyway. If I cap it to 60FPS, both bits of code go down to like 3% in xperf.

In reality 400fps is not very good, if you are not blocking somewhere then you have a CPU bottleneck. It is plenty fast for the single chunk but if you want 6 visible chunks you've dropped under 60fps. Using simple fullscreen clear, you can hit 4/5k+ fps and something this simple should be hitting at least 1k. (Basically about 3 times as fast as you are seeing.)

What do you mean by double buffered vertex buffers though? I just have some ID3D11Buffer objects which I put data into as part of the CreateBuffer call, or ocassionally Map/Unmap (mainly constant buffers containing a matrix for some object). I never read from the ID3D11Buffer, although I guess it is possible if I specified with cpu read flag.

It really depends on usage. Even when marked as read only, bad things can happen at the api/driver level if you don't do things in a specific manner which is sometimes driver specific. I avoid the driver specifics myself by double buffering the VB's (assuming I'm going to change everything). Basically say you map a buffer, write 6 vertices, unmap, issue a draw triangle for those 6, map again, write some more, issue some more. Unless you properly say "I'm only changing 6 vertices starting here", the driver really doesn't have any manner of knowing that you only care about those six. It may be smart enough to figure out that the draw call only uses the six and only do a single DMA for that small portion but some (many?) are not that smart and will push the entire buffer over the interconnect for each call to render a cube. Massive wastage and you run out of driver command queue slots very quickly since each draw call would have association with a DMA requirement to update the VB prior to the command execution.

Individual sectional modifications can be better for some things, your cube rendering may be of this case depending on how you have setup the DMA streams and perform your rendering. On the other hand, if you are going to modify the entire thing, the double buffer is usually best in order to avoid "most" driver differences, assuming you can afford the memory.

Double buffering the VB's just means two VB's and you write to the one not in use while the other is rendering. Sorry to post all the details before saying this simple definition, but those are important details to keep in mind in determining if this is best for your usage.

As for the kernel calls, will take a look. A 50 second xperf results only had a few samples in them, not sure exactly how those work and how xperf would interact though. Actually on inspection almost looks like ETW/xperf detecting itself

   |     |     |     |     |     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock (2 samples in 50 seconds)
|     |     |     |     |     |     |- ntkrnlmp.exe!KiDpcInterrupt (1 sample in 50 seconds)
|     |     |     |     |     |     |     ntkrnlmp.exe!KxDispatchInterrupt
|     |     |     |     |     |     |     ntkrnlmp.exe!SwapContext_PatchXRstor
|     |     |     |     |     |     |     ntkrnlmp.exe!EtwTraceContextSwap


Task manager kernel line seems to fluctuate between 1/8 and 1/6 of load (approx 30%, approx 3.75% to 5% kernel), allthough hard to know how much of that is my doing. Without my app running just all the other stuff looks like 1 to 5%.

I forgot one suggestion, turn on "one graph per CPU" also. Due to affinity, your application will likely run on a single core and nothing else will be dispatched to that core while your game is running. (For the most part.) So, the normal 1-4% idle kernel work will by nature be done on the other cores and the core with your game on it will be a pretty clean view. 1/8th-1/6th is bad, it really should be in the 2-3% at most range if you are not getting blocked by drivers and such.

Was just using "xperf -on latency -stackwalk profile" for the profiles, I recall there are a bunch of other things but can't really remember them, just been using that from a .bat I made like a year ago. Might go look through the list again.

I've never used XPerf unfortunately, I always have companies buy me VTune because while not the easiest in the world to get working properly it is generally the best once it does work.

### #17SyncViews  Members   -  Reputation: 465

Like
0Likes
Like

Posted 22 August 2013 - 03:07 AM

400FPS with around 370 chunks (renders a circle of chunks up to 384 units away) with a broken/disabled frustum cull (separate issue I need go think about, since my code only actually tested the aabb corner points, and the aabb for a chunk is large enough that gives false negatives). But yes, I think that hits the CPU limit there rather than the GPU limit (AMD Radeon HD 7870) but I suspect once I have some graphics that include more than solid pixel rendering with the simplest ever shader and a few 16x16 textures that gap will close.

For VB's Isn't that what the no-cpu/discard/no-overwrite/etc vertex buffer access things are for? Or even creating a new vertex buffer from scratch (still playing with the best way to handle that, since obviously I cant just make one with enough space for every vertex that might ever exist in the chunk). Not even sure how I am meant to know the GPU is truly finished, since Present doesn't mean the GPU is finish? Practically I guess worst case is I might change a given vertex buffer once every few frames, and others nearly never. Its the constant buffers that seem to get changed all the time to render models and such with different transforms.

I did consider VTune, but would also require an Intel CPU, which I have been considering since before Ivy Bridge came out, but been hoping there next gen might actually boost CPU performance to a point I care for the cost and each time around seemed to get offered the same cost + inflation, with a faster integrated GPU...

### #18phantom  Moderators   -  Reputation: 7433

Like
0Likes
Like

Posted 22 August 2013 - 03:50 AM

Or, ya know, you could break out some tools to properly profile things instead of doing all this guess work?

Simply looking at the attached CPU trace shows the CPUs are largely idle during that run which means you still have resources to spare.

AMD have tools to profile both the CPU and GPU which you can get for free; use them to find out WHAT is going on instead of vaguely waving your hands around and going 'well, its probably this...' - that isn't programming, that's voodoo.

### #19AllEightUp  Moderators   -  Reputation: 4241

Like
0Likes
Like

Posted 22 August 2013 - 04:36 AM

Or, ya know, you could break out some tools to properly profile things instead of doing all this guess work?

No offense intended but I believe given the information in the thread there is pretty decent profiling being performed. I'd ignore the thread if I thought it was all addhock "something be bad". Current replies suggest a pretty decent amount of profile and test, which to me suggests an interesting problem.

### #20AllEightUp  Moderators   -  Reputation: 4241

Like
1Likes
Like

Posted 22 August 2013 - 05:21 AM

400FPS with around 370 chunks (renders a circle of chunks up to 384 units away) with a broken/disabled frustum cull.....

Ah, BIG difference in my thinking then. I thought your numbers were for a single 32x32x300 chunk. I guess I missed that it was in a multiple chunk context. I need to reconsider some of my initial concepts to a degree. Though, it actually suggests that the API is being a blocking item and not your code, which suggests more likelyhood of bad API usage. Though, in such a case it is likely a "minor" item which simply adds up.

Sorry if I'm annoying you by picking at things but it is usually the "picky stuff" that is the problem. If you are rendering that many chunks then the "little" things add up. A single fence implied by modifying a vertex buffer could be the cause of the problems. You seem to be using vertex buffers well though so I won't repeat said issues, but there are plenty of other easy to ignore but highly performance critical variations to data access "possible" here, I'm just poking at the most common ones given current information.

For VB's Isn't that what the no-cpu/discard/no-overwrite/etc vertex buffer access things are for?

Yip, very much so. But... The drivers don't always work the same way when you modify the same buffer 2 or 3 times per frame. It *SHOULD* for all intents and purposes behave the same no matter the driver but..... They don't, sorry. A certain driver uses a "best case" optimization assuming that the buffer will change once per frame at most. As such any full buffer locks instead of sectional locks will quickly start blocking because the driver can no longer rely on that implied "user" behavior.

This is the unfortunate problem of a "thin" layer over the drivers. They are all supposed to behave the same way but they don't. Outside of the API specifics, things are kinda random when performance items are involved.

Or even creating a new vertex buffer from scratch (still playing with the best way to handle that, since obviously I cant just make one with enough space for every vertex that might ever exist in the chunk). Not even sure how I am meant to know the GPU is truly finished, since Present doesn't mean the GPU is finish? Practically I guess worst case is I might change a given vertex buffer once every few frames, and others nearly never. Its the constant buffers that seem to get changed all the time to render models and such with different transforms.

Yeah, sucks don't it. Sorry, I have absolutely no personal feeling on this, been there, done that, and figure you have or will learn all the nightmares eventually. I can say don't do x, y or z all day and you will likely still have parts doing those things. My suggestions and experience just won't really make sense till you actually run into some of the annoying issues. All I can really do is explain why you stepped on yer member and how to fix it, wish I could really say "don't do X" but there is no clear way to describe X to start with.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS