Multithreading a games logic

Started by
25 comments, last by _the_phantom_ 10 years, 8 months ago

No offense intended but I believe given the information in the thread there is pretty decent profiling being performed. I'd ignore the thread if I thought it was all addhock "something be bad". Current replies suggest a pretty decent amount of profile and test, which to me suggests an interesting problem.


Comments along the line of 'I think it hits the CPU limit not the GPU limit' would seem to indicate a degree of guess work is still going on, more so when combined with a CPU trace which shows mostly idle usage across all the cores.

The fact the driver shows up in the trace could indicate that the OP is completing CPU frames at a faster rate than the GPU is clearing them thus being stalled out by the driver blocking while resources are cleared. (It could also be a case of too many resources being discarded per frame putting the driver under too much pressure as it has limited memory to perform buffer updates in this manner.)

Also, the OP mentions not using vTune as he lacks an Intel CPU but AMD also has other profiling tools which could help track down the problem (GPU: PerfHUD, CPU : CodeXL) further.

The point is I object to any case of 'I think this...' without supporting evidence - that is voodoo.
Advertisement

Comments along the line of 'I think it hits the CPU limit not the GPU limit' would seem to indicate a degree of guess work is still going on, more so when combined with a CPU trace which shows mostly idle usage across all the cores.


Yes, guesswork is going on. Prior to having thousands of dollars of software to speed up my investigations, I used to have to guess, A LOT! Given free'ish tools, lots of guessing is the best that most folks can do. I intend to encourage "GOOD" guessing as I can't use all my software on their problems and point things out specifically. Learning to guess "WELL" is possibly the best ability any software engineer can learn. So I question why you have an issue with this?


Encouraging "good guessing" and suggesting free methods to at least lead towards proper results seems much more effective than being an ass. Sorry to be an ass and call you an ass, but this particular item is a hot button for me. Folks doing the best they can (and the details were very detailed) getting such a negative reply pisses me off. I will never be that negative and I will encourage someone who is obviously doing what they can, just possibly missing those details you and I know about. It's called learning, you don't have to be an ass about it.

Well I managed to get an implementation of that solid buffer idea working (not currently bit packed, just a bool array). Will take a more detailed look in the morning to see if I can optimise it, but with the profiler running during some normal game play I got some info, so it was not loaded too heavily with newly loaded chunks or changes to existing ones.

The problem of logic steps that recreate buffers is still there with around 20ms per chunk that was re-cached which can push some update steps into the 60 or 70 ms range.


|- ChunkRenderer::buildCache                            3.24
|     |- SimpleBlockRenderer::cache                     2.57
//The solid checks appear to have been fully inlined, but still take most of the time
|     |     |- SimpleBlockRenderer::cache<itself>       2.05
|     |     |- SimpleBlockRenderer::cacheTop            0.16
|     |     |- SimpleBlockRenderer::cacheNorth          0.08
|     |     |- SimpleBlockRenderer::cacheEast           0.07
|     |     |- SimpleBlockRenderer::cacheWest           0.07
|     |     |- SimpleBlockRenderer::cacheBottom         0.07
|     |     |- SimpleBlockRenderer::cacheSouth          0.06
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock   0.00
|     |     |- ntkrnlmp.exe!KiDpcInterrupt              0.00
//Looping over the chunks blocks, calling the correct renderer for the block type
//(only SimpleBlockRenderer exists currently)
|     |- ChunkRenderer::buildCache<itself>              0.27
//Largely got inlined. This function actually fills out a layer of the buffer using the world data
|     |- SolidBlockBuffer::fillLayer                    0.22
//This basically takes the vertex/index buffer for each texture and makes a single pair of buffers
//will a list of start/count integers for each texture ready for the render thread to take
|     |- ChunkStaticRenderer::buildCache                0.08


void SimpleBlockRenderer::cache(
    WorldChunk &chunk,
    ChunkRenderer &chunkRenderer,
    const SolidBlockBuffer &solidBlockCache,
    const Block *block,
    BlockId bid,
    //TODO: x,z could be coordinates within the chunk, rather than the world
    int x,
    int y,
    int z)
{
    if (!solidBlockCache.isSolid(chunk, x, y + 1, z))
        cacheTop(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x, y - 1, z))
        cacheBottom(chunk, chunkRenderer, block, bid, x, y, z);

    if (!solidBlockCache.isSolid(chunk, x + 1, y, z))
        cacheEast(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x - 1, y, z))
        cacheWest(chunk, chunkRenderer, block, bid, x, y, z);

    if (!solidBlockCache.isSolid(chunk, x, y, z + 1))
        cacheNorth(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x, y, z - 1))
        cacheSouth(chunk, chunkRenderer, block, bid, x, y, z);
    return;
}
bool SolidBlockBuffer::isSolid(const WorldChunk &chunk, int x, int y, int z)const
{
    //TODO: This messing around to get bufferIndex can be avoided
    //e.g. could just keep the bufferIndex for y - 1, y and y + 1 rather
    //than just y, and could have 3 variations of isSolid rather than take
    //the y param.
    //TODO: Can also avoid this -= on x, z for every check
    x -= chunk.getBlockX();
    z -= chunk.getBlockZ();
    assert(x >= -1 && x <= World::CHUNK_SIZE);
    assert(z >= -1 && z <= World::CHUNK_SIZE);

    assert(y >= currentY - 1 && y <= currentY + 1);
    int d = y - currentY;
    assert(d >= -1 && d <= 1);

    if (d == -1) d += 3;
    int bufferIndex = (currentBuffer + d) % 3;
    x += 1;
    z += 1;
    return buffer[bufferIndex][x * SIZE + z];
}

Going to take a look at the AMD tool as well. Would be really nice to see whats going on within inlined functions and manually taking the xperf instruction addresses and comparing them to the VS disassembly panel is a pain, but only ever seen anything about that with VTune and an ICC compiled module.

EDIT:

Another idea I suppose is to spread the work out since not every update step is modifying multiple vertex buffers. But that can either result in a never ending task to get to a "everything is done" state if enough changes are happening, or errors where adjacent chunks don't line up fully because one was not updated. Could also reduce the chunk size at the cost of making some other things less efficient. More vertex/index buffers doing smaller draws, more expensive cross-chunk block accesses, more difficulties with large map features that are not purely noise based (e.g. if a tree or building could span its own chunk, an adjacent chunk, and one beyond that), but I guess might be worth it.

Will give the threading idea ago as well just to see, should be simple enough for this chunk vertex buffer issue since building each buffer only reads the world data, never changes it, and only writes to its own buffer, no one elses. But going to see if AMD can tell me if I am memory bandwidth limited or not during that operation first since if I am the only thing I can think of to really do is look at improving CPU cache usage more to improve the efficiency of that code.

Also while yes I am sure there is further things I can do to use the GPU better and hit 1000fps, right now the only thing that ever causes noticeable lag spikes is various things on the logic thread, which is at present basically.

- Updating chunks (including some initial stuff like planting the first trees) / entities

- Calculating lighting (I am sure this is going to be a pain later, but for now I ignored chunks and pretty much every thing else here and did a "canSeeSky()" lighting system)

- Creating those vertex buffers if something changed

Rendering, audio and, load/save (well the compression and disk IO part) already have there own threads so hopefully don't impact this 20Hz logic problem.

Finished off with that code. 7 to 10ms per chunk, over 2x faster than the original code biggrin.png

Still a slight problem of if several things happen at once it gets overloaded and has a 100ms update step, but think now I just need to work on trying to ensure certain things happen on different frames. Perhaps have the concept of a critical vertex rebuild (e.g. a block was removed on the chunk border, so not rebuilding both vertex buffers can cause a glitch with the player seeing through a gap, which cause a performance problem anyway originally) and ones that can be done on any step with spare time (changes in the middle of chunks, updated lighting, change of texture, etc.)

EDIT:

Just a quick one on the buffer thing. While I know I can improve various aspects of my vertex and index buffer usage, what about the constant buffers?

E.g. I have one that basically contains a 4x4 float matrix my shaders use for transforms, and at present I map/write/unmap that a tonne of times (e.g. for rendering entities, although I believe when I get round to it those should be using instancing anyway these days?)?

Does the driver do something special for these less than a few hundred bytes constants buffers (compared to large vertex buffers), or was the actual intention I have say 50 copies of the buffer and use each once per frame? MSDN seems to basically just say don't put data in a bound cbuffer the shader does not use and don't update parts of a cbuffer you dont have to (split it into a frequently and infrequently updated cbuffer) but nothing about if one should reuse the same ID3D11Buffer for every update (something like 500 per frame in their example) or not.

Sorry for the delay, went out of town again...

Finished off with that code. 7 to 10ms per chunk, over 2x faster than the original code biggrin.png

Still a slight problem of if several things happen at once it gets overloaded and has a 100ms update step, but think now I just need to work on trying to ensure certain things happen on different frames. Perhaps have the concept of a critical vertex rebuild (e.g. a block was removed on the chunk border, so not rebuilding both vertex buffers can cause a glitch with the player seeing through a gap, which cause a performance problem anyway originally) and ones that can be done on any step with spare time (changes in the middle of chunks, updated lighting, change of texture, etc.)


The speed between bits and bools is likely to be fairly trivial depending on your CPU cache size. Small caches you might see a bit of a speed up, large caches I doubt any change. I'm a bit annoyed it was not closer to the 3x speedup I was thinking it would be though, that suggests there is still another problem in there we have not found or your order of memory access is still not linear. Though, at this point it could easily be limited purely on memory access rates on your CPU, so it's maxed out now. (Seems unlikely though for such a small thing, relatively speaking.)

EDIT:
Just a quick one on the buffer thing. While I know I can improve various aspects of my vertex and index buffer usage, what about the constant buffers?
E.g. I have one that basically contains a 4x4 float matrix my shaders use for transforms, and at present I map/write/unmap that a tonne of times (e.g. for rendering entities, although I believe when I get round to it those should be using instancing anyway these days?)?

Does the driver do something special for these less than a few hundred bytes constants buffers (compared to large vertex buffers), or was the actual intention I have say 50 copies of the buffer and use each once per frame? MSDN seems to basically just say don't put data in a bound cbuffer the shader does not use and don't update parts of a cbuffer you dont have to (split it into a frequently and infrequently updated cbuffer) but nothing about if one should reuse the same ID3D11Buffer for every update (something like 500 per frame in their example) or not.


When it comes to constant buffers there is a problem as with other larger items. If you make a change, send a draw call and make another change, "most" drivers I believe do properly record only the changes and send them as part of the draw calls. Unfortunately though, the smallest DMA request for most cards is I believe 32 bytes and that is a single command in the driver side buffer which is fixed length. So the driver needs to allocate a command to issue and also a 32byte memory transfer buffer which, not as likely, can/will run the driver side out of memory. Drivers have to be pretty simple minded to work in kernel mode, and as such, lots of little changes (which map to 32byte DMA transfers, or what ever the minimal size is now) will give them fits of bad performance like anything else.

Of course, all said and done, unless you modify these things every call and have thousands of calls, they will not add up to a single texture update and as such unlikely to be a problem. I only point out the details as a "watch it" comment. smile.png

My 1055T has 128KB L1, 512KB L2 and 6MB L3. Not really researched what other processors have these days, but the SolidBlockBuffer data array is a solid 3468 bytes, so I assume it should just be able to load that in its entirety (since even though I iterate in the z direction which is adjacent access, I still do the +- x and y jumps as well).

There is some other accesses of other data that I think I can rework to avoid in the average "nothing to render here" case.

Going to dig through the CodeXL docs to see if there is anything there useful in addition to the plain sample profile data I have. Guess it might at least give the number of cache misses in a region of code, so can see if its excessive.

When it comes to constant buffers there is a problem as with other larger items. If you make a change, send a draw call and make another change, "most" drivers I believe do properly record only the changes and send them as part of the draw calls. Unfortunately though, the smallest DMA request for most cards is I believe 32 bytes and that is a single command in the driver side buffer which is fixed length. So the driver needs to allocate a command to issue and also a 32byte memory transfer buffer which, not as likely, can/will run the driver side out of memory. Drivers have to be pretty simple minded to work in kernel mode, and as such, lots of little changes (which map to 32byte DMA transfers, or what ever the minimal size is now) will give them fits of bad performance like anything else.


This of course depends on how you update the buffers; if you are doing a 'discard' then the driver will likely assign new memory to your buffer instead of waiting on any update contention GPU side - this means plenty of book keeping AND internal driver limits (AMD GCN performance tweets give a 8MB 'rename' buffer with current DX11 drivers; in other words it can 'rename' the buffer for up to 8meg of data before it has to wait for previous data to become free to reuse thus causing stalling).

This might not be a problem for your constant blocks as it's likely they are so small they multiple discards per frame aren't hurting the rename buffer (although they are probably doing a number on the book keeping side of things; see below) but this does apply to any and ALL buffers created with 'dynamic'.

A sequence like [discard-update]-[draw]-[discard-update]-[draw]-[discard-update]-[draw]-[discard-update]-[draw] is likely to be hell for this as all 4 of those draws are likely to be inflight at the same time which means the driver has had to deal with 4 renames.

This topic is closed to new replies.

Advertisement