Multithreading a games logic

Started by
25 comments, last by _the_phantom_ 10 years, 8 months ago


Indeed.

The other option is to maintain 'shadow state' where by objects have a 'public' copy of data (likely a subset of internal state); anyone is free to read from it and the objects only update their internal state. At a predefined point in time all objects copy 'live' to 'shadow'.

Yes, all your reads are one frame behind but for most things this is unlikely to make a huge difference and removes any and all locks from the system. This is probably best expressed in 'tasks'.

Yeah, I actually use shadow states myself but I still use the barrier solution for "system" update separation, I just figured the outline was a nice simple example of the different method of looking at things. Anyway, the better example would be move all the objects in one shot using shadows, barrier, apply all shadows, barrier, update sweep/prune awareness system, culling, and anything else not interdependent, barrier, issue rendering etc etc. You still need the concept of the barrier to separate those updates when appropriate but it is still a massive reduction in locking/blocking.

As to the "task" system, I have a different way to look at it. Some items distribute differently than others, so I supply a generic concept of "distributor" which will simply be called by all threads in the team. So, an object update distributor simply divides up the updates to be called among all threads and lets them go to town, it doesn't care if they come back at different rates and there is no blocking, just one atomic to track which sections of the array have been issued to each thread. Threads come out and start executing the next distributor which in the outline would be a barrier since we want to make sure all objects are updated prior to applying the shadows. (Assuming the tasks are side by side as in the example.) On the other hand, I have a sweep prune awareness system which is currently only multicored enough to use 3 threads right now, so that uses a custom distributor which grabs the first three threads which enter and lets all the remaining ones pass through, those threads could go onto say update the scene graph/culling since it is not reliant on the sweep prune nor affected by it so they end up running in parallel.

As to the reads being a frame behind, I prefer it that way since it is a consistent guarantee which can be made and at worst if you absolutely have to be looking at "now" you can just extrapolate 1 approx frame ahead which is usually fairly accurate at such a fine grain delta. Order issues in object dependencies is always a nightmare in traditional game loops, that object updated first but depended on this object, next frame they may be reversed and you end up with nasty little oscillations in things like a follow state.

Advertisement

Well the render thread (using the lockless access of entity data with a triple buffering scheme, and those prepared static vertex data manages 400+ FPS if I don't cap it. Profiling shows 6% in WinMain (the rendering thread), plus some 10% in "atidxx64.dll", no idea what that is, or what thread it even is, its just another item under "kernel32.dll!BaseThreadInitThunk" and they seem to add upto the 16% CPU available to the thread.

I suppose I could optimise that, e.g. simply cap the frame rate, but theres is still another 4 cores to use on this box before reducing CPU usage of some thread seems to be a concern. Certainly don't see the need for multi-cpre rendering, at least in the sense I normally see advertised doesn't help me create the buffers any better. So really want to look at the logic side since it is constantly missing its 20Hz goal whenever things happen that needs lights and static buffers rebuilt.

Is there some kind of counter that says how often the CPU stopped because it needed data from main memory? I think i read Intel have some sort of internal performance counters, not sure about AMD...

I'll give the solid buffers thing a go in the morning, since that seems to be the biggest cost since the vast majority of faces are not visible. I guess I need the buffers to be side length + 2 long to handle the edges, but should be easy enough to fill in a loop. Will check which order I put the buffer in as well (have a feeling its x * a + y * b + z, or possibly y first for some reason).

So basically if it is a memory cache/bandwidth issue running that block of code (well the code that does the 32x32xN chunk) on 4 threads at once won't give a 4x speedup, even if there are no locks for any threads, just a "wait for everything to be done" before continuing on after (e.g. like below)?



    for (auto it = chunks.begin(); it != chunks.end(); ++it)
    {
        auto chunk = *it;
        if (chunk->hasRendererBlockChanged() && chunk->shouldRender())
        {
            // chunk->getRenderer().buildCache();
            // run with existing thread in the pool
            // (idle threads will just wait on a WaitForSingleObject or similar)
            // read only on the world/chunk data, just writes to chunk->getRenderer() which is a per
            // chunk object
            threadPool.run(
                std::bind(
                    &ChunkRenderer::buildCache,
                    &chunk->getRenderer()));
        }
    }
    //wait for all queued tasks to complete on this pool
    //perhaps let it use this thread as well to help, although guess it does not matter
    //if it does or blocks the thread, if the pool has the right number of threads internally
    //(just the run calls above)
    threadPool.wait();

Well the render thread (using the lockless access of entity data with a triple buffering scheme, and those prepared static vertex data manages 400+ FPS if I don't cap it. Profiling shows 6% in WinMain (the rendering thread), plus some 10% in "atidxx64.dll", no idea what that is, or what thread it even is, its just another item under "kernel32.dll!BaseThreadInitThunk" and they seem to add upto the 16% CPU available to the thread.


The first item 'atidxx64.dll' is the actual driver for your video card. It is possible that it is completely valid but 10% seems a bit high so I'd go back and double check your vertex buffer accesses are double buffered and all that. Using Pix (or whatever) is likely to be the only way to figure out the details of your utilization though. The kernel calls are generally bad items also, as an experiment (and just a rough idea of how bad), you can bring up task manager, go to performance tab, and select view/show kernel times. Run your project and if the red line is anywhere near the green line, your problems are "likely" in the graphics api usage and/or other threading issues such as massive blocking on a mutex somewhere. (This is just a very rough way to get some extra data, no promises how accurate it is. smile.png)

I suppose I could optimise that, e.g. simply cap the frame rate, but theres is still another 4 cores to use on this box before reducing CPU usage of some thread seems to be a concern. Certainly don't see the need for multi-cpre rendering, at least in the sense I normally see advertised doesn't help me create the buffers any better. So really want to look at the logic side since it is constantly missing its 20Hz goal whenever things happen that needs lights and static buffers rebuilt.

Is there some kind of counter that says how often the CPU stopped because it needed data from main memory? I think i read Intel have some sort of internal performance counters, not sure about AMD...


There are quite a few counters available, unfortunately accessing them is a pain in the ass or requires something like VTune. There used to be a library which worked on Windows called Papi, but they removed the windows support some time back I believe.

I'll give the solid buffers thing a go in the morning, since that seems to be the biggest cost since the vast majority of faces are not visible. I guess I need the buffers to be side length + 2 long to handle the edges, but should be easy enough to fill in a loop. Will check which order I put the buffer in as well (have a feeling its x * a + y * b + z, or possibly y first for some reason).

So basically if it is a memory cache/bandwidth issue running that block of code (well the code that does the 32x32xN chunk) on 4 threads at once won't give a 4x speedup, even if there are no locks for any threads, just a "wait for everything to be done" before continuing on after (e.g. like below)?


Well, as with anything, you could get some benefit by multi-threading it but unfortunately that is unlikely to help until you know exactly where the slowdown is. I'd run the rough test looking at kernel time first before further exploring this, if that kernel time is notable then my suggestions would be to go back to looking at your API usages.

Well I guessed it was the display driver, but still dont see why it shows up in the xperf stack trace there, why not inside a ID3D11DeviceContext->XYZ call? Its as if it has its own thread, blocks mine then runs its for a bit? I guess at 400FPS though it could just be back buffer management stuff in practice, like I mentioned in some other thread I doubt MS or AMD/Nvidia/Intel optimised all the things that should be once per frame anyway. If I cap it to 60FPS, both bits of code go down to like 3% in xperf.

What do you mean by double buffered vertex buffers though? I just have some ID3D11Buffer objects which I put data into as part of the CreateBuffer call, or ocassionally Map/Unmap (mainly constant buffers containing a matrix for some object). I never read from the ID3D11Buffer, although I guess it is possible if I specified with cpu read flag.

As for the kernel calls, will take a look. A 50 second xperf results only had a few samples in them, not sure exactly how those work and how xperf would interact though. Actually on inspection almost looks like ETW/xperf detecting itself


   |     |     |     |     |     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock (2 samples in 50 seconds)
   |     |     |     |     |     |     |- ntkrnlmp.exe!KiDpcInterrupt (1 sample in 50 seconds)
   |     |     |     |     |     |     |     ntkrnlmp.exe!KxDispatchInterrupt
   |     |     |     |     |     |     |     ntkrnlmp.exe!SwapContext_PatchXRstor
   |     |     |     |     |     |     |     ntkrnlmp.exe!EtwTraceContextSwap

Task manager kernel line seems to fluctuate between 1/8 and 1/6 of load (approx 30%, approx 3.75% to 5% kernel), allthough hard to know how much of that is my doing. Without my app running just all the other stuff looks like 1 to 5%.

Was just using "xperf -on latency -stackwalk profile" for the profiles, I recall there are a bunch of other things but can't really remember them, just been using that from a .bat I made like a year ago. Might go look through the list again.

I just wanted to reply to this little bit. Game logic is perfectly valid for multicore execution, you just need to do it correctly. I use the single access rule in my game objects so I can multicore the hell out of them and it works like a charm with very low overhead. Just for example's sake, say you have the following:

-snip-


Oh I don't mind, it's true game logic can be multithreaded depending on the case but like you said it is an awfully large brush for the thing in general. Game logic can cover a lot of things from world interactions to AI and what can be threaded surely depends a lot on the game.

To me it's something you would only really want to consider when you exhaust other options because it has so many points of possible clash unlike something like IO or sound, which are more easily seperated out and tend to burn cpu time doing moderate work.

But you didn't suggest jumping right to multithreading, which is good, I definitely think exhausting other options is helpful before throwing threads at something like a giant whip like people tend to do.

Good ideas though.

Well I guessed it was the display driver, but still dont see why it shows up in the xperf stack trace there, why not inside a ID3D11DeviceContext->XYZ call? Its as if it has its own thread, blocks mine then runs its for a bit?

Actually, that is exactly the case, normally. The thread is for the driver and the reason it doesn't show up is usually because of the kernel transition in front of whatever call is causing the block. As a kernel thread it is a lighter weight version of a user thread and doesn't follow all the same rules. This of course tends to break callstacks since there is poor translation between the two forms of threads. (NOTE: Been nearly 8 years since I've messed with kernel stuff, perhaps things have changed a bit but don't think so.)

I guess at 400FPS though it could just be back buffer management stuff in practice, like I mentioned in some other thread I doubt MS or AMD/Nvidia/Intel optimised all the things that should be once per frame anyway. If I cap it to 60FPS, both bits of code go down to like 3% in xperf.

In reality 400fps is not very good, if you are not blocking somewhere then you have a CPU bottleneck. It is plenty fast for the single chunk but if you want 6 visible chunks you've dropped under 60fps. Using simple fullscreen clear, you can hit 4/5k+ fps and something this simple should be hitting at least 1k. (Basically about 3 times as fast as you are seeing.)

What do you mean by double buffered vertex buffers though? I just have some ID3D11Buffer objects which I put data into as part of the CreateBuffer call, or ocassionally Map/Unmap (mainly constant buffers containing a matrix for some object). I never read from the ID3D11Buffer, although I guess it is possible if I specified with cpu read flag.

It really depends on usage. Even when marked as read only, bad things can happen at the api/driver level if you don't do things in a specific manner which is sometimes driver specific. I avoid the driver specifics myself by double buffering the VB's (assuming I'm going to change everything). Basically say you map a buffer, write 6 vertices, unmap, issue a draw triangle for those 6, map again, write some more, issue some more. Unless you properly say "I'm only changing 6 vertices starting here", the driver really doesn't have any manner of knowing that you only care about those six. It may be smart enough to figure out that the draw call only uses the six and only do a single DMA for that small portion but some (many?) are not that smart and will push the entire buffer over the interconnect for each call to render a cube. Massive wastage and you run out of driver command queue slots very quickly since each draw call would have association with a DMA requirement to update the VB prior to the command execution.

Individual sectional modifications can be better for some things, your cube rendering may be of this case depending on how you have setup the DMA streams and perform your rendering. On the other hand, if you are going to modify the entire thing, the double buffer is usually best in order to avoid "most" driver differences, assuming you can afford the memory.

Double buffering the VB's just means two VB's and you write to the one not in use while the other is rendering. Sorry to post all the details before saying this simple definition, but those are important details to keep in mind in determining if this is best for your usage.

As for the kernel calls, will take a look. A 50 second xperf results only had a few samples in them, not sure exactly how those work and how xperf would interact though. Actually on inspection almost looks like ETW/xperf detecting itself


   |     |     |     |     |     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock (2 samples in 50 seconds)
   |     |     |     |     |     |     |- ntkrnlmp.exe!KiDpcInterrupt (1 sample in 50 seconds)
   |     |     |     |     |     |     |     ntkrnlmp.exe!KxDispatchInterrupt
   |     |     |     |     |     |     |     ntkrnlmp.exe!SwapContext_PatchXRstor
   |     |     |     |     |     |     |     ntkrnlmp.exe!EtwTraceContextSwap

Task manager kernel line seems to fluctuate between 1/8 and 1/6 of load (approx 30%, approx 3.75% to 5% kernel), allthough hard to know how much of that is my doing. Without my app running just all the other stuff looks like 1 to 5%.


I forgot one suggestion, turn on "one graph per CPU" also. Due to affinity, your application will likely run on a single core and nothing else will be dispatched to that core while your game is running. (For the most part.) So, the normal 1-4% idle kernel work will by nature be done on the other cores and the core with your game on it will be a pretty clean view. 1/8th-1/6th is bad, it really should be in the 2-3% at most range if you are not getting blocked by drivers and such.

Was just using "xperf -on latency -stackwalk profile" for the profiles, I recall there are a bunch of other things but can't really remember them, just been using that from a .bat I made like a year ago. Might go look through the list again.


I've never used XPerf unfortunately, I always have companies buy me VTune because while not the easiest in the world to get working properly it is generally the best once it does work.

400FPS with around 370 chunks (renders a circle of chunks up to 384 units away) with a broken/disabled frustum cull (separate issue I need go think about, since my code only actually tested the aabb corner points, and the aabb for a chunk is large enough that gives false negatives). But yes, I think that hits the CPU limit there rather than the GPU limit (AMD Radeon HD 7870) but I suspect once I have some graphics that include more than solid pixel rendering with the simplest ever shader and a few 16x16 textures that gap will close.

For VB's Isn't that what the no-cpu/discard/no-overwrite/etc vertex buffer access things are for? Or even creating a new vertex buffer from scratch (still playing with the best way to handle that, since obviously I cant just make one with enough space for every vertex that might ever exist in the chunk). Not even sure how I am meant to know the GPU is truly finished, since Present doesn't mean the GPU is finish? Practically I guess worst case is I might change a given vertex buffer once every few frames, and others nearly never. Its the constant buffers that seem to get changed all the time to render models and such with different transforms.

I did consider VTune, but would also require an Intel CPU, which I have been considering since before Ivy Bridge came out, but been hoping there next gen might actually boost CPU performance to a point I care for the cost and each time around seemed to get offered the same cost + inflation, with a faster integrated GPU...

Or, ya know, you could break out some tools to properly profile things instead of doing all this guess work?

Simply looking at the attached CPU trace shows the CPUs are largely idle during that run which means you still have resources to spare.

AMD have tools to profile both the CPU and GPU which you can get for free; use them to find out WHAT is going on instead of vaguely waving your hands around and going 'well, its probably this...' - that isn't programming, that's voodoo.

Or, ya know, you could break out some tools to properly profile things instead of doing all this guess work?


No offense intended but I believe given the information in the thread there is pretty decent profiling being performed. I'd ignore the thread if I thought it was all addhock "something be bad". Current replies suggest a pretty decent amount of profile and test, which to me suggests an interesting problem.

400FPS with around 370 chunks (renders a circle of chunks up to 384 units away) with a broken/disabled frustum cull.....


Ah, BIG difference in my thinking then. I thought your numbers were for a single 32x32x300 chunk. smile.png I guess I missed that it was in a multiple chunk context. I need to reconsider some of my initial concepts to a degree. Though, it actually suggests that the API is being a blocking item and not your code, which suggests more likelyhood of bad API usage. Though, in such a case it is likely a "minor" item which simply adds up.

Sorry if I'm annoying you by picking at things but it is usually the "picky stuff" that is the problem. If you are rendering that many chunks then the "little" things add up. A single fence implied by modifying a vertex buffer could be the cause of the problems. You seem to be using vertex buffers well though so I won't repeat said issues, but there are plenty of other easy to ignore but highly performance critical variations to data access "possible" here, I'm just poking at the most common ones given current information.

For VB's Isn't that what the no-cpu/discard/no-overwrite/etc vertex buffer access things are for?


Yip, very much so. But... The drivers don't always work the same way when you modify the same buffer 2 or 3 times per frame. It *SHOULD* for all intents and purposes behave the same no matter the driver but..... They don't, sorry. A certain driver uses a "best case" optimization assuming that the buffer will change once per frame at most. As such any full buffer locks instead of sectional locks will quickly start blocking because the driver can no longer rely on that implied "user" behavior.

This is the unfortunate problem of a "thin" layer over the drivers. They are all supposed to behave the same way but they don't. Outside of the API specifics, things are kinda random when performance items are involved. sad.png

Or even creating a new vertex buffer from scratch (still playing with the best way to handle that, since obviously I cant just make one with enough space for every vertex that might ever exist in the chunk). Not even sure how I am meant to know the GPU is truly finished, since Present doesn't mean the GPU is finish? Practically I guess worst case is I might change a given vertex buffer once every few frames, and others nearly never. Its the constant buffers that seem to get changed all the time to render models and such with different transforms.


Yeah, sucks don't it. smile.png Sorry, I have absolutely no personal feeling on this, been there, done that, and figure you have or will learn all the nightmares eventually. I can say don't do x, y or z all day and you will likely still have parts doing those things. My suggestions and experience just won't really make sense till you actually run into some of the annoying issues. All I can really do is explain why you stepped on yer member and how to fix it, wish I could really say "don't do X" but there is no clear way to describe X to start with.

This topic is closed to new replies.

Advertisement