# Performance tips

This topic is 2381 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hello everybody

I'm working on an engine for a game i'm planning to write, i plan to make it an RPG, and if it goes well, i might evolve it to be an MMORPG, where the player is a part of an army.
i've done alot of stuff - but performance is killing me, as it's an army, there should be a massive number of soldiers, terrain, and some trees/rocks, well at least in the battlefield, terrains are not much of a problem, LOD will solve the problem, as not much of it will be visible anyway, so the main problem is the soldiers, it looks like a low poly soldier will be composed of at least 500 vertices to look half-decent, which could be satisfying, but that seems to be too much to render, let's say there are a thousand visible soldiers, i tried rendering them with a basic flat color effect, and the FPS was 4.

so yeah, that's the problem - i can reduce the visible amount of soldiers by adding alot of fog and dust (caused by war chaos), it gives some realism too, anyway, but i will still need to render some hundreds of soldiers - what can i do to improve performance? will multi-threading be of good use here? how?

please feel free to share thoughts, ideas and tips

##### Share on other sites
Hello

Do you use (plan to use) geometry instancing for rendering your soldiers ?
I would think this is necessary here because you plan to render a lots of time the same geometry

##### Share on other sites
If it helps, and i can use it, why not!, but i have read somewhere that the maximum amount vertices for instancing is 32, i don't really know if thats true, and i really hope it's not, i might have misread it perhaps? well, what are the restrictions of instancing? does every instance have it's own animation frame?
i yet haven't applied animation system (which is next on my list), so i'm not yet sure about it.

thanks for the suggestion - i hope instancing large objects is possible.(large in vertex count)

##### Share on other sites
They're no such limitation as far as I know (I'm using it with 100 and more instances, at least)
(Maybe this is more a problem of cache; if the geometry can't be held as a whole in the cache, there will be cache swapping, and this could lead to performance fall.)
You can handle animations (skin meshes) with instancing

##### Share on other sites
thanks for the reply, i will implement instancing now and see how it goes - as for multithreading, does anyone know if it could benefit me? a feature of Direct3D11 is multithreaded rendering, but does it improve performance? i have seen users reporting weaker performance with it, is that true?<br>

##### Share on other sites
What version of D3D are you using?

Assuming D3D9, how many calls to DrawPrimitive or DrawIndexedPrimitive are you making? You need to reduce the number of draw calls as much as possible if you want to get good performance - which is what instancing helps with.

##### Share on other sites
I'm using D3D11, sorry for not mentioning this.

##### Share on other sites
I've got a question regarding instancing: how are instances rendered without an index buffer? how are triangles connected to each others?

EDIT: figured out that DrawIndexedInstanced exists, well it's pretty nice, and FPS is increased by a great amount, though, still not enough, what else can i do? i'll ask once again, is multithreading able to help me, or it's a waste of time?

##### Share on other sites
-Reduce all unnecessary state changes.
-Use Instancing as much as you can (there's a sample in NVidia DirectX 10 SDK called "Skinned instancing" take a look at it; You should be able to draw all soldiers with a single draw call
-Profile - Get nVidia PerfHUD or AMD PerfStudio and find out if you're either CPU or GPU bound (if you're GPU bound check what is doing most of work, vertex or pixel shader, and also if your ALU or Texture-fetch bound.
-Try to get decent frame rate without using multi-threading...

What technique are you using to render terrain? Does you use any LOD terrain technique? Again, PerfHUD can show you if your using too much time in terrain rendering or something else...

Use the data you get from profiling programs to see what you have to improve.

I'm guessing you are either doing too much work on the vertex shader rendering the soldiers and the terrain. Have you considered using billboards to draw distant soldiers? (4 vertices per soldier = WIN)

EDIT: Some time ago I read a paper about rendering thousands of zombies onscreen... (It was from Valve or something, I can't find it now but maybe someone knows and I think it might be useful to you)

##### Share on other sites
I will look into PrefHUD now.

Actually i'm running a simple test to have my proof of concept, yet, the test only renders a quad of 2 triangles and an amount of soldiers, all soldiers are rendered in 1 draw call, will not use my terrain currently as it has very poor LOD system and thus not very efficient, might look into tessellation for a solution, because LOD is already causing me alot of problems

as for state changes - i think they're not alot, but do they really consume much time?

also, well, im using a little bit heavy shaders, running with flat color shaders gives me 300-400 FPS in 640x480 windowed, but that is not quite satisfying, as i will need texturing, and i really want to have per-pixel lighting for some planned effects - which will probably drop me below 60 FPS, even in 640x480, but i will try to tear it apart and might even end up with vertex lighting if it's faster

though, the soldiers will not be of the same type, there will be at least 4 types of soldiers, plus up to 4 unique characters, with rocks and trees, i will need more draw calls, thus less fps, so i thought maybe i could make 2 threads, each one renders a part, and when they both finish, i present the swap chain - wouldn't that speed things up?

EDIT: @ your edit: billboards are great - i will look into them, but, aren't they for static objects? if each soldier has his own animation, wouldn't i need to render each soldier to rexture and then render it on a quad? this seems slower than usual , though, i will (eh, must..) use billboards for trees/rocks

##### Share on other sites
Don't bother with PerfHUD, it doesn't work with D3D11. Get Parallel Nsight instead, it's their replacement for PerfHUD.

##### Share on other sites
About the multithreading in D3D11, it all depends on how things are done in your engine. If you do enough CPU work for each soldier that you are processing, then it is possible that using multithreading could help. By CPU work, I am talking about both state changes and draw calls, plus also with the matrix multiplication or whatever else you do to set up each rendering operation.

One of the chapter samples from our book (linked below) handles a similar (but still different) situation as you have - many simple renderings without too much GPU work done in each one. In my testing, it uses a few reflective objects that require environment maps to be rendered of the whole scene. For each env. map needed, the whole scene has to be rendered again. Then I scale up the number of objects from 200 to 1000, and test single threaded vs. multithreaded modes. The multithreaded mode is up to 60% faster in a quad core machine compared to when it is run in single threaded mode. Here's a image of the scenario:

[attachment=4829:MirrorMirror.png]

You can take a look at the code from the Hieroglyph 3 repository in the MirrorMirror demo, although it probably won't be directly applicable to your engine... anyways, it might be useful to check out.

##### Share on other sites

also, well, im using a little bit heavy shaders, running with flat color shaders gives me 300-400 FPS in 640x480 windowed, but that is not quite satisfying, as i will need texturing, and i really want to have per-pixel lighting for some planned effects - which will probably drop me below 60 FPS, even in 640x480, but i will try to tear it apart and might even end up with vertex lighting if it's faster

You can perform per-pixel lighting with nice effects in close soldiers and render distant soldiers using per-vertex lighting (think of it as a effects LOD system)

Don't bother with PerfHUD, it doesn't work with D3D11. Get Parallel Nsight instead, it's their replacement for PerfHUD.

Oh, looks like I missed a post

##### Share on other sites
@TiagoCosta: good idea actually, will certainly implement that

@MJP: Alright well, i already have god parallel nsight, though, never thought of using it for analysis, thought it was just a shader debugging tool (never got shader debugging working - anyway, mods says i need 2 machines...whatever..).
looks like i'm calling Map/Unmap ALOT while implementing my own effect system (i didn't like the framework provided by microsoft, doesnt look official..), i think i can get a good FPS increase after optimizing it

@Jason Z: Well thanks for sharing that, i took a quick look through your mirrormirror demo, i can't seem to find multithreading code, is it deep in the system? mind if you point it out?

@everyone: it also looks like i've got alot of math calculations going there, well i don't think i can minimize these, would a compute shader help me in this situation? i have never tried it really, but is it worth trying?

##### Share on other sites

@MJP: Alright well, i already have god parallel nsight, though, never thought of using it for analysis, thought it was just a shader debugging tool (never got shader debugging working - anyway, mods says i need 2 machines...whatever..).

You only need 2 machines or 2 GPU's for debugging shaders. For API debugging or performance analysis you can do it on the host machine.

looks like i'm calling Map/Unmap ALOT while implementing my own effect system (i didn't like the framework provided by microsoft, doesnt look official..), i think i can get a good FPS increase after optimizing it

Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).

##### Share on other sites
Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).[/quote]

in fact i have this system implemented but there appears to be a hole in it, i'm working on fixing it

I've also got one more question, is tessellation the right way to handle terrain LOD, if it's even possible?

##### Share on other sites
@Jason Z: Well thanks for sharing that, i took a quick look through your mirrormirror demo, i can't seem to find multithreading code, is it deep in the system? mind if you point it out?
[/quote]
It is integrated at the heart of the RendererDX11 class, with the multithreading being handled based on the application calling RendererDX11::SetMultiThreadingState().

Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).[/quote]

Multithreading can also help with mapping buffers, since the work is done on multiple threads. There are a few limitations though - the buffers can't be read from, and depending on how many you are mapping in a single command list, there can be quite a bit of extra memory being carried around by the driver until you execute the command list. Even so, I think it is worth trying out.

About the terrain rendering, I think it all depends on what you are doing and how it has to work... Tessellation isn't free, but if you use it to reduce a massive amount of vertices from being processed every frame, then it is a net win. Can you specify more specifics about your desired system?

##### Share on other sites
About the terrain rendering, I think it all depends on what you are doing and how it has to work... Tessellation isn't free, but if you use it to reduce a massive amount of vertices from being processed every frame, then it is a net win. Can you specify more specifics about your desired system?[/quote]
Well it's a basic terrain system, yet nothing special, but terrains should be really huge - already have a frustum culling system and some fog, looks good but there are still around 10-30 sectors will always be drawn, each one consists of 67x67=4489 vertices, for the average, 20, 4489* 20 = 89780, i think it's worth reducing?

my current LOD system is very poor, it is merely a couple of index buffers, one for each stage, vertex positions doesn't change, only the triangle count, gives me somewhat good performance, but the problem was that there are gaps between different LOD sectors, for example, one sector has 33 vertices on a side, and another one, next to it has 17, they simply can't stick together without gaps, i was thinking about keeping the outer vertices active while doing reduction in the inner triangles, though, still can't find a good way to do it.

Also, i came across an idea yesterday, i've already heard about vertex shaders being able to sample textures - well then, maybe i could render all terrain sectors at once using an instance, then i just give the heightmap to the shader, and it offset each vertex by the amount at the u,v coords in the map, i'm going to try it soon.

EDIT: edited vertices count, sorry.

##### Share on other sites
sampling textures in the vertex shader is not a good idea - didn't completely work and no performance gain anyway, however, with LOD or without, there is too much work for the CPU, alot of draw calls, i will try to render terrain sectors in multiple threads to see how it would work.

@Jason Z: looks like you're implementing it the same way i was planning to do - i will give it a try

##### Share on other sites
I'm having a little problem - I get an access violation reading location xxxx when i try to ExecuteCommandList from the immediate context, here is my current code:

 HANDLE handles[HX_TERRAIN_RENDERING_THREADS_COUNT]; RenderSectors* args[HX_TERRAIN_RENDERING_THREADS_COUNT]; UINT sectorCount[HX_TERRAIN_RENDERING_THREADS_COUNT] = { 0 }; for ( UINT i = 0; i < SectorsToRender.size ( ); i++ ) sectorCount[i%HX_TERRAIN_RENDERING_THREADS_COUNT]++; for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ ) { args = new RenderSectors ( ); args->context = _core->GetD3D11DeferredContext ( i ); args->numSectors = sectorCount; args->sectors = i == 0 ? &SectorsToRender[0] : &SectorsToRender[sectorCount[i-1]]; args->indexBuffers = i == 0 ? &indexBuffersToRender[0] : &indexBuffersToRender[sectorCount[i-1]]; handles = (HANDLE)_beginthread ( __RenderSectors, 0, (void*)args ); } WaitForMultipleObjects ( HX_TERRAIN_RENDERING_THREADS_COUNT, handles, TRUE, INFINITE ); for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ ) { _core->GetD3D11DeviceContext()->ExecuteCommandList ( args->commandList, FALSE ); //access violation reading location xxxx HX_SAFE_FREE ( args->commandList ); HX_SAFE_DELETE ( args ); } 

basically, i have 2 vectors containing the buffers required for rendering, i create a new RenderSectors object that holds pointers to those buffers and some variables and a pointer to a deferred context to use, and pass it to the thread so it can render them
and when all threads finish, i ExecuteCommandList from the immediate context, and this is where the error occurs, _core->GetD3D11DeviceContext() returns a valid pointer to the immediate context, and the args->commandList is also valid

also, here is my thread code:
 static void __RenderSectors ( void* sectors ) { RenderSectors* renderSectors = (RenderSectors*)sectors; if ( !renderSectors ) return; for ( UINT i = 0; i < renderSectors->numSectors; i++ ) { //set the mesh data in the input assembler to draw the mesh UINT Strides[1]; UINT Offsets[1]; ID3D11Buffer* pVB[1]; pVB[0] = renderSectors->sectors->vb; Strides[0] = sizeof _hxTerrainVertex; Offsets[0] = 0; renderSectors->context->IASetVertexBuffers ( 0, 1, pVB, Strides, Offsets ); renderSectors->context->IASetIndexBuffer ( renderSectors->indexBuffers->buffer, DXGI_FORMAT_R32_UINT, 0 ); //draw the mesh renderSectors->context->DrawIndexed ( renderSectors->indexBuffers->size, 0, 0 ); } renderSectors->context->FinishCommandList ( FALSE, &renderSectors->commandList ); _endthread ( ); }; 

looks like i missed a point regarding multi-threaded rendering?

##### Share on other sites
Are you calling release on your command lists once you are done with them? Do you reuse them from frame to frame, or are they recycled multiple times?

Also, have you tried running the same program on the reference device, just to make sure you have a sequencing problem?

##### Share on other sites
command lists are recycled every frame, can they be reused?

anyway - looks like i was setting states only for the immediate context, now there are 2 problems:
FPS is down from 60~ to 8
nothing is rendered, not the terrain nor my normal objects (instances)
and then, sometimes i get an exception and the debug layers reports:
D3D11: CORRUPTION: ID3D11DeviceContext::ExecuteCommandList: First parameter is corrupt or NULL. [ MISCELLANEOUS CORRUPTION #13: CORRUPTED_PARAMETER1 ]
perhaps - synchronization error? how can this be, when the thread doesn't end before creating a command list, and im calling WaitForMultipleObjects from the main thread?

EDIT: with some experiments, i'm also getting this, sometimes:
D3D11: CORRUPTION: ID3D11DeviceContext::IASetInputLayout: Two threads were found to be executing functions associated with the same Device at the same time. This will cause corruption of memory. Appropriate thread synchronization needs to occur external to the Direct3D API. 3000 and 4432 are the implicated thread ids. [ MISCELLANEOUS CORRUPTION #28: CORRUPTED_MULTITHREADING ]
does it mean that 2 threads are using the same context?

##### Share on other sites
Alright - it doesn't seem to be only a synchronization problem, even if i run each thread then wait for it using WaitForSingleObject, i don't get anything rendered, note that doing so removes all synchronization, which means that there were also a synchronization problem.

most recent code:
 //first, set the states of all contexts for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ ) { ID3D11DeviceContext* context = _core->GetD3D11DeferredContext ( i ); //render the effect (update it's values) _effect->Render ( context ); context->PSSetSamplers ( 0, 1, &_sampleState ); FLOAT blendFactors[] = { 0.0f, 0.0f, 0.0f, 0.0f }; context->IASetPrimitiveTopology ( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST ); context->RSSetState ( _rasterizerState ); context->OMSetDepthStencilState ( _depthStencilState, 0 ); context->OMSetBlendState ( _blendState, blendFactors, 0xffffffff ); context->PSSetSamplers ( 0, 1, &_sampleState ); } HANDLE handles[HX_TERRAIN_RENDERING_THREADS_COUNT]; RenderSectors* args[HX_TERRAIN_RENDERING_THREADS_COUNT]; UINT sectorCount[HX_TERRAIN_RENDERING_THREADS_COUNT] = { 0 }; for ( UINT i = 0; i < SectorsToRender.size ( ); i++ ) sectorCount[i%HX_TERRAIN_RENDERING_THREADS_COUNT]++; UINT address = 0; for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ ) { if ( sectorCount ) { args = new RenderSectors ( ); args->context = _core->GetD3D11DeferredContext ( i ); args->numSectors = sectorCount; args->sectors = &SectorsToRender[address]; args->indexBuffers = &indexBuffersToRender[address]; address += sectorCount; handles = (HANDLE)_beginthread ( __RenderSectors, 0, (void*)args ); } else handles = NULL; } WaitForMultipleObjects ( HX_TERRAIN_RENDERING_THREADS_COUNT, handles, TRUE, INFINITE ); for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ ) { if ( args ) { if ( args->numSectors ) { _core->GetD3D11DeviceContext()->ExecuteCommandList ( args->commandList, FALSE ); HX_SAFE_FREE ( args->commandList ); HX_SAFE_DELETE ( args ); } } } 

shouldn't this make sure that code under WaitForMultipleObjects will NEVER execute before all my threads are finished? well this is the thread code:

 static void __RenderSectors ( void* sectors ) { RenderSectors* renderSectors = (RenderSectors*)sectors; if ( !renderSectors ) return; for ( UINT i = 0; i < renderSectors->numSectors; i++ ) { //set the mesh data in the input assembler to draw the mesh UINT Strides[1]; UINT Offsets[1]; ID3D11Buffer* pVB[1]; pVB[0] = renderSectors->sectors->vb; Strides[0] = sizeof _hxTerrainVertex; Offsets[0] = 0; renderSectors->context->IASetVertexBuffers ( 0, 1, pVB, Strides, Offsets ); renderSectors->context->IASetIndexBuffer ( renderSectors->indexBuffers->buffer, DXGI_FORMAT_R32_UINT, 0 ); //draw the mesh renderSectors->context->DrawIndexed ( renderSectors->indexBuffers->size, 0, 0 ); } renderSectors->context->FinishCommandList ( FALSE, &renderSectors->commandList ); _endthread ( ); }; 

note at the end, it calls FinishCommandList from the provided context, which is a deferred context, and it's not being used by another thread, so i don't really see why im getting the 2 errors mentioned in my previous post.

##### Share on other sites
i used CreateEvent/SetEvent, and now i no longer get synchronization problems, but still, nothing is rendered, i'm trying to set the driver type to reference, but i can't manage to do it, for D3D11CreateDevice, if i pass NULL for the adapter and D3D_DRIVER_TYPE_REFERENCE for the driver type, it succeeds, but then i cannot create a swap chain, and the debug layer reports this warning:
DXGI Warning: IDXGIFactory::CreateSwapChain: This function is being called with a device from a different IDXGIFactory.
so i must pass the adapter, how can i create a reference device then?

that's somewhat disappointing, even though nothing is rendered, i still get the same FPS, and maybe a little less, but i will continue to the end and see how it goes...

EDIT:
 D3D11_FEATURE_DATA_THREADING support; _D3DDevice->CheckFeatureSupport ( D3D11_FEATURE_THREADING, &support, sizeof D3D11_FEATURE_DATA_THREADING ); 

this gives me 100% positive results, my hardware has full support for multithreading

##### Share on other sites

i used CreateEvent/SetEvent, and now i no longer get synchronization problems, but still, nothing is rendered, i'm trying to set the driver type to reference, but i can't manage to do it, for D3D11CreateDevice, if i pass NULL for the adapter and D3D_DRIVER_TYPE_REFERENCE for the driver type, it succeeds, but then i cannot create a swap chain, and the debug layer reports this warning:
DXGI Warning: IDXGIFactory::CreateSwapChain: This function is being called with a device from a different IDXGIFactory.
so i must pass the adapter, how can i create a reference device then?

that's somewhat disappointing, even though nothing is rendered, i still get the same FPS, and maybe a little less, but i will continue to the end and see how it goes...

EDIT:
 D3D11_FEATURE_DATA_THREADING support; _D3DDevice->CheckFeatureSupport ( D3D11_FEATURE_THREADING, &support, sizeof D3D11_FEATURE_DATA_THREADING ); 

this gives me 100% positive results, my hardware has full support for multithreading

I think this journal post might be helpful for your current situation... although I don't think that has anything to do with threading.

Have you taken a frame capture with PIX yet? That could probably be helpful to figure out what is going on. I would recommend that you build your MT support so that you can gracefully switch back to a single-threaded mode that only uses the immediate context. Then when you can prove out your code with the ST mode, and switch to using deferred contexts+command lists+immediate context afterwards.

Something else that might be helpful when using pix with multithreading is to set the event markers wherever you are starting and stopping the generation of a command list. If you still have a copy of Hieroglyph 3 on your machine, you can take a look at the RendererDX11::PIXBeginEvent() and RendererDX11::PIXEndEvent() for how to implement them. It can get pretty hard to see what is going on when multiple threads are dumping code simultaneously, and these will help you make sense of them.