Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 06 Feb 2011
Offline Last Active Dec 02 2013 07:09 AM

Posts I've Made

In Topic: Multithreading a games logic

25 August 2013 - 04:59 AM

My 1055T has 128KB L1, 512KB L2 and 6MB L3. Not really researched what other processors have these days, but the SolidBlockBuffer data array is a solid 3468 bytes, so I assume it should just be able to load that in its entirety (since even though I iterate in the z direction which is adjacent access, I still do the +- x and y jumps as well).


There is some other accesses of other data that I think I can rework to avoid in the average "nothing to render here" case.


Going to dig through the CodeXL docs to see if there is anything there useful in addition to the plain sample profile data I have. Guess it might at least give the number of cache misses in a region of code, so can see if its excessive.

In Topic: Multithreading a games logic

23 August 2013 - 02:41 AM

Finished off with that code. 7 to 10ms per chunk, over 2x faster than the original code  biggrin.png


Still a slight problem of if several things happen at once it gets overloaded and has a 100ms update step, but think now I just need to work on trying to ensure certain things happen on different frames. Perhaps have the concept of a critical vertex rebuild (e.g. a block was removed on the chunk border, so not rebuilding both vertex buffers can cause a glitch with the player seeing through a gap, which cause a performance problem anyway originally) and ones that can be done on any step with spare time (changes in the middle of chunks, updated lighting, change of texture, etc.)



Just a quick one on the buffer thing. While I know I can improve various aspects of my vertex and index buffer usage, what about the constant buffers?

E.g. I have one that basically contains a 4x4 float matrix my shaders use for transforms, and at present I map/write/unmap that a tonne of times (e.g. for rendering entities, although I believe when I get round to it those should be using instancing anyway these days?)?


Does the driver do something special for these less than a few hundred bytes constants buffers (compared to large vertex buffers), or was the actual intention I have say 50 copies of the buffer and use each once per frame? MSDN seems to basically just say don't put data in a bound cbuffer the shader does not use and don't update parts of a cbuffer you dont have to (split it into a frequently and infrequently updated cbuffer) but nothing about if one should reuse the same ID3D11Buffer for every update (something like 500 per frame in their example) or not.

In Topic: Multithreading a games logic

22 August 2013 - 12:19 PM

Well I managed to get an implementation of that solid buffer idea working (not currently bit packed, just a bool array). Will take a more detailed look in the morning to see if I can optimise it, but with the profiler running during some normal game play I got some info, so it was not loaded too heavily with newly loaded chunks or changes to existing ones.


The problem of logic steps that recreate buffers is still there with around 20ms per chunk that was re-cached which can push some update steps into the 60 or 70 ms range.

|- ChunkRenderer::buildCache                            3.24
|     |- SimpleBlockRenderer::cache                     2.57
//The solid checks appear to have been fully inlined, but still take most of the time
|     |     |- SimpleBlockRenderer::cache<itself>       2.05
|     |     |- SimpleBlockRenderer::cacheTop            0.16
|     |     |- SimpleBlockRenderer::cacheNorth          0.08
|     |     |- SimpleBlockRenderer::cacheEast           0.07
|     |     |- SimpleBlockRenderer::cacheWest           0.07
|     |     |- SimpleBlockRenderer::cacheBottom         0.07
|     |     |- SimpleBlockRenderer::cacheSouth          0.06
|     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock   0.00
|     |     |- ntkrnlmp.exe!KiDpcInterrupt              0.00
//Looping over the chunks blocks, calling the correct renderer for the block type
//(only SimpleBlockRenderer exists currently)
|     |- ChunkRenderer::buildCache<itself>              0.27
//Largely got inlined. This function actually fills out a layer of the buffer using the world data
|     |- SolidBlockBuffer::fillLayer                    0.22
//This basically takes the vertex/index buffer for each texture and makes a single pair of buffers
//will a list of start/count integers for each texture ready for the render thread to take
|     |- ChunkStaticRenderer::buildCache                0.08

void SimpleBlockRenderer::cache(
    WorldChunk &chunk,
    ChunkRenderer &chunkRenderer,
    const SolidBlockBuffer &solidBlockCache,
    const Block *block,
    BlockId bid,
    //TODO: x,z could be coordinates within the chunk, rather than the world
    int x,
    int y,
    int z)
    if (!solidBlockCache.isSolid(chunk, x, y + 1, z))
        cacheTop(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x, y - 1, z))
        cacheBottom(chunk, chunkRenderer, block, bid, x, y, z);

    if (!solidBlockCache.isSolid(chunk, x + 1, y, z))
        cacheEast(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x - 1, y, z))
        cacheWest(chunk, chunkRenderer, block, bid, x, y, z);

    if (!solidBlockCache.isSolid(chunk, x, y, z + 1))
        cacheNorth(chunk, chunkRenderer, block, bid, x, y, z);
    if (!solidBlockCache.isSolid(chunk, x, y, z - 1))
        cacheSouth(chunk, chunkRenderer, block, bid, x, y, z);
bool SolidBlockBuffer::isSolid(const WorldChunk &chunk, int x, int y, int z)const
    //TODO: This messing around to get bufferIndex can be avoided
    //e.g. could just keep the bufferIndex for y - 1, y and y + 1 rather
    //than just y, and could have 3 variations of isSolid rather than take
    //the y param.
    //TODO: Can also avoid this -= on x, z for every check
    x -= chunk.getBlockX();
    z -= chunk.getBlockZ();
    assert(x >= -1 && x <= World::CHUNK_SIZE);
    assert(z >= -1 && z <= World::CHUNK_SIZE);

    assert(y >= currentY - 1 && y <= currentY + 1);
    int d = y - currentY;
    assert(d >= -1 && d <= 1);

    if (d == -1) d += 3;
    int bufferIndex = (currentBuffer + d) % 3;
    x += 1;
    z += 1;
    return buffer[bufferIndex][x * SIZE + z];

Going to take a look at the AMD tool as well. Would be really nice to see whats going on within inlined functions and manually taking the xperf instruction addresses and comparing them to the VS disassembly panel is a pain, but only ever seen anything about that with VTune and an ICC compiled module.



Another idea I suppose is to spread the work out since not every update step is modifying multiple vertex buffers. But that can either result in a never ending task to get to a "everything is done" state if enough changes are happening, or errors where adjacent chunks don't line up fully because one was not updated. Could also reduce the chunk size at the cost of making some other things less efficient. More vertex/index buffers doing smaller draws, more expensive cross-chunk block accesses, more difficulties with large map features that are not purely noise based (e.g. if a tree or building could span its own chunk, an adjacent chunk, and one beyond that), but I guess might be worth it.


Will give the threading idea ago as well just to see, should be simple enough for this chunk vertex buffer issue since building each buffer only reads the world data, never changes it, and only writes to its own buffer, no one elses. But going to see if AMD can tell me if I am memory bandwidth limited or not during that operation first since if I am the only thing I can think of to really do is look at improving CPU cache usage more to improve the efficiency of that code.


Also while yes I am sure there is further things I can do to use the GPU better and hit 1000fps, right now the only thing that ever causes noticeable lag spikes is various things on the logic thread, which is at present basically.

- Updating chunks (including some initial stuff like planting the first trees) / entities

- Calculating lighting (I am sure this is going to be a pain later, but for now I ignored chunks and pretty much every thing else here and did a "canSeeSky()" lighting system)

- Creating those vertex buffers if something changed


Rendering, audio and, load/save (well the compression and disk IO part) already have there own threads so hopefully don't impact this 20Hz logic problem.

In Topic: Multithreading a games logic

22 August 2013 - 03:07 AM

400FPS with around 370 chunks (renders a circle of chunks up to 384 units away) with a broken/disabled frustum cull (separate issue I need go think about, since my code only actually tested the aabb corner points, and the aabb for a chunk is large enough that gives false negatives). But yes, I think that hits the CPU limit there rather than the GPU limit (AMD Radeon HD 7870) but I suspect once I have some graphics that include more than solid pixel rendering with the simplest ever shader and a few 16x16 textures that gap will close.


For VB's Isn't that what the no-cpu/discard/no-overwrite/etc vertex buffer access things are for? Or even creating a new vertex buffer from scratch (still playing with the best way to handle that, since obviously I cant just make one with enough space for every vertex that might ever exist in the chunk). Not even sure how I am meant to know the GPU is truly finished, since Present doesn't mean the GPU is finish? Practically I guess worst case is I might change a given vertex buffer once every few frames, and others nearly never. Its the constant buffers that seem to get changed all the time to render models and such with different transforms.


I did consider VTune, but would also require an Intel CPU, which I have been considering since before Ivy Bridge came out, but been hoping there next gen might actually boost CPU performance to a point I care for the cost and each time around seemed to get offered the same cost + inflation, with a faster integrated GPU...


In Topic: Multithreading a games logic

21 August 2013 - 05:33 PM

Well I guessed it was the display driver, but still dont see why it shows up in the xperf stack trace there, why not inside a ID3D11DeviceContext->XYZ call? Its as if it has its own thread, blocks mine then runs its for a bit? I guess at 400FPS though it could just be back buffer management stuff in practice, like I mentioned in some other thread I doubt MS or AMD/Nvidia/Intel optimised all the things that should be once per frame anyway. If I cap it to 60FPS, both bits of code go down to like 3% in xperf.


What do you mean by double buffered vertex buffers though? I just have some ID3D11Buffer objects which I put data into as part of the CreateBuffer call, or ocassionally Map/Unmap (mainly constant buffers containing a matrix for some object). I never read from the ID3D11Buffer, although I guess it is possible if I specified with cpu read flag.


As for the kernel calls, will take a look. A 50 second xperf results only had a few samples in them, not sure exactly how those work and how xperf would interact though. Actually on inspection almost looks like ETW/xperf detecting itself

   |     |     |     |     |     |     |- ntkrnlmp.exe!KiInterruptDispatchNoLock (2 samples in 50 seconds)
   |     |     |     |     |     |     |- ntkrnlmp.exe!KiDpcInterrupt (1 sample in 50 seconds)
   |     |     |     |     |     |     |     ntkrnlmp.exe!KxDispatchInterrupt
   |     |     |     |     |     |     |     ntkrnlmp.exe!SwapContext_PatchXRstor
   |     |     |     |     |     |     |     ntkrnlmp.exe!EtwTraceContextSwap

Task manager kernel line seems to fluctuate between 1/8 and 1/6 of load (approx 30%, approx 3.75% to 5% kernel), allthough hard to know how much of that is my doing. Without my app running just all the other stuff looks like 1 to 5%.



Was just using "xperf -on latency -stackwalk profile" for the profiles, I recall there are a bunch of other things but can't really remember them, just been using that from a .bat I made like a year ago. Might go look through the list again.