Well I managed to get an implementation of that solid buffer idea working (not currently bit packed, just a bool array). Will take a more detailed look in the morning to see if I can optimise it, but with the profiler running during some normal game play I got some info, so it was not loaded too heavily with newly loaded chunks or changes to existing ones.
The problem of logic steps that recreate buffers is still there with around 20ms per chunk that was re-cached which can push some update steps into the 60 or 70 ms range.
|- ChunkRenderer::buildCache 3.24
| |- SimpleBlockRenderer::cache 2.57
//The solid checks appear to have been fully inlined, but still take most of the time
| | |- SimpleBlockRenderer::cache<itself> 2.05
| | |- SimpleBlockRenderer::cacheTop 0.16
| | |- SimpleBlockRenderer::cacheNorth 0.08
| | |- SimpleBlockRenderer::cacheEast 0.07
| | |- SimpleBlockRenderer::cacheWest 0.07
| | |- SimpleBlockRenderer::cacheBottom 0.07
| | |- SimpleBlockRenderer::cacheSouth 0.06
| | |- ntkrnlmp.exe!KiInterruptDispatchNoLock 0.00
| | |- ntkrnlmp.exe!KiDpcInterrupt 0.00
//Looping over the chunks blocks, calling the correct renderer for the block type
//(only SimpleBlockRenderer exists currently)
| |- ChunkRenderer::buildCache<itself> 0.27
//Largely got inlined. This function actually fills out a layer of the buffer using the world data
| |- SolidBlockBuffer::fillLayer 0.22
//This basically takes the vertex/index buffer for each texture and makes a single pair of buffers
//will a list of start/count integers for each texture ready for the render thread to take
| |- ChunkStaticRenderer::buildCache 0.08
void SimpleBlockRenderer::cache(
WorldChunk &chunk,
ChunkRenderer &chunkRenderer,
const SolidBlockBuffer &solidBlockCache,
const Block *block,
BlockId bid,
//TODO: x,z could be coordinates within the chunk, rather than the world
int x,
int y,
int z)
{
if (!solidBlockCache.isSolid(chunk, x, y + 1, z))
cacheTop(chunk, chunkRenderer, block, bid, x, y, z);
if (!solidBlockCache.isSolid(chunk, x, y - 1, z))
cacheBottom(chunk, chunkRenderer, block, bid, x, y, z);
if (!solidBlockCache.isSolid(chunk, x + 1, y, z))
cacheEast(chunk, chunkRenderer, block, bid, x, y, z);
if (!solidBlockCache.isSolid(chunk, x - 1, y, z))
cacheWest(chunk, chunkRenderer, block, bid, x, y, z);
if (!solidBlockCache.isSolid(chunk, x, y, z + 1))
cacheNorth(chunk, chunkRenderer, block, bid, x, y, z);
if (!solidBlockCache.isSolid(chunk, x, y, z - 1))
cacheSouth(chunk, chunkRenderer, block, bid, x, y, z);
return;
}
bool SolidBlockBuffer::isSolid(const WorldChunk &chunk, int x, int y, int z)const
{
//TODO: This messing around to get bufferIndex can be avoided
//e.g. could just keep the bufferIndex for y - 1, y and y + 1 rather
//than just y, and could have 3 variations of isSolid rather than take
//the y param.
//TODO: Can also avoid this -= on x, z for every check
x -= chunk.getBlockX();
z -= chunk.getBlockZ();
assert(x >= -1 && x <= World::CHUNK_SIZE);
assert(z >= -1 && z <= World::CHUNK_SIZE);
assert(y >= currentY - 1 && y <= currentY + 1);
int d = y - currentY;
assert(d >= -1 && d <= 1);
if (d == -1) d += 3;
int bufferIndex = (currentBuffer + d) % 3;
x += 1;
z += 1;
return buffer[bufferIndex][x * SIZE + z];
}
Going to take a look at the AMD tool as well. Would be really nice to see whats going on within inlined functions and manually taking the xperf instruction addresses and comparing them to the VS disassembly panel is a pain, but only ever seen anything about that with VTune and an ICC compiled module.
EDIT:
Another idea I suppose is to spread the work out since not every update step is modifying multiple vertex buffers. But that can either result in a never ending task to get to a "everything is done" state if enough changes are happening, or errors where adjacent chunks don't line up fully because one was not updated. Could also reduce the chunk size at the cost of making some other things less efficient. More vertex/index buffers doing smaller draws, more expensive cross-chunk block accesses, more difficulties with large map features that are not purely noise based (e.g. if a tree or building could span its own chunk, an adjacent chunk, and one beyond that), but I guess might be worth it.
Will give the threading idea ago as well just to see, should be simple enough for this chunk vertex buffer issue since building each buffer only reads the world data, never changes it, and only writes to its own buffer, no one elses. But going to see if AMD can tell me if I am memory bandwidth limited or not during that operation first since if I am the only thing I can think of to really do is look at improving CPU cache usage more to improve the efficiency of that code.
Also while yes I am sure there is further things I can do to use the GPU better and hit 1000fps, right now the only thing that ever causes noticeable lag spikes is various things on the logic thread, which is at present basically.
- Updating chunks (including some initial stuff like planting the first trees) / entities
- Calculating lighting (I am sure this is going to be a pain later, but for now I ignored chunks and pretty much every thing else here and did a "canSeeSky()" lighting system)
- Creating those vertex buffers if something changed
Rendering, audio and, load/save (well the compression and disk IO part) already have there own threads so hopefully don't impact this 20Hz logic problem.