Performance tips

Started by
30 comments, last by Hassanbasil 12 years, 8 months ago
Don't bother with PerfHUD, it doesn't work with D3D11. Get Parallel Nsight instead, it's their replacement for PerfHUD.
Advertisement
About the multithreading in D3D11, it all depends on how things are done in your engine. If you do enough CPU work for each soldier that you are processing, then it is possible that using multithreading could help. By CPU work, I am talking about both state changes and draw calls, plus also with the matrix multiplication or whatever else you do to set up each rendering operation.

One of the chapter samples from our book (linked below) handles a similar (but still different) situation as you have - many simple renderings without too much GPU work done in each one. In my testing, it uses a few reflective objects that require environment maps to be rendered of the whole scene. For each env. map needed, the whole scene has to be rendered again. Then I scale up the number of objects from 200 to 1000, and test single threaded vs. multithreaded modes. The multithreaded mode is up to 60% faster in a quad core machine compared to when it is run in single threaded mode. Here's a image of the scenario:

[attachment=4829:MirrorMirror.png]


You can take a look at the code from the Hieroglyph 3 repository in the MirrorMirror demo, although it probably won't be directly applicable to your engine... anyways, it might be useful to check out.

also, well, im using a little bit heavy shaders, running with flat color shaders gives me 300-400 FPS in 640x480 windowed, but that is not quite satisfying, as i will need texturing, and i really want to have per-pixel lighting for some planned effects - which will probably drop me below 60 FPS, even in 640x480, but i will try to tear it apart and might even end up with vertex lighting if it's faster

You can perform per-pixel lighting with nice effects in close soldiers and render distant soldiers using per-vertex lighting (think of it as a effects LOD system)



Don't bother with PerfHUD, it doesn't work with D3D11. Get Parallel Nsight instead, it's their replacement for PerfHUD.


Oh, looks like I missed a post :huh:
@TiagoCosta: good idea actually, will certainly implement that :)

@MJP: Alright well, i already have god parallel nsight, though, never thought of using it for analysis, thought it was just a shader debugging tool (never got shader debugging working - anyway, mods says i need 2 machines...whatever..).
looks like i'm calling Map/Unmap ALOT while implementing my own effect system (i didn't like the framework provided by microsoft, doesnt look official..), i think i can get a good FPS increase after optimizing it

@Jason Z: Well thanks for sharing that, i took a quick look through your mirrormirror demo, i can't seem to find multithreading code, is it deep in the system? mind if you point it out?

@everyone: it also looks like i've got alot of math calculations going there, well i don't think i can minimize these, would a compute shader help me in this situation? i have never tried it really, but is it worth trying?

@MJP: Alright well, i already have god parallel nsight, though, never thought of using it for analysis, thought it was just a shader debugging tool (never got shader debugging working - anyway, mods says i need 2 machines...whatever..).


You only need 2 machines or 2 GPU's for debugging shaders. For API debugging or performance analysis you can do it on the host machine.


looks like i'm calling Map/Unmap ALOT while implementing my own effect system (i didn't like the framework provided by microsoft, doesnt look official..), i think i can get a good FPS increase after optimizing it


Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).
Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).[/quote]

in fact i have this system implemented but there appears to be a hole in it, i'm working on fixing it

I've also got one more question, is tessellation the right way to handle terrain LOD, if it's even possible?
@Jason Z: Well thanks for sharing that, i took a quick look through your mirrormirror demo, i can't seem to find multithreading code, is it deep in the system? mind if you point it out?
[/quote]
It is integrated at the heart of the RendererDX11 class, with the multithreading being handled based on the application calling RendererDX11::SetMultiThreadingState().

Yes, you'll definitely want to minimize how much you Map your buffers. For constant buffers in particular, you should try to organize them so that you can minimize the number of times you need to update them in a frame. Typically this is done by separating constants into separate buffers based on frequency of update (such as constants updated once per frame, once per render pass, once per object, etc.).[/quote]

Multithreading can also help with mapping buffers, since the work is done on multiple threads. There are a few limitations though - the buffers can't be read from, and depending on how many you are mapping in a single command list, there can be quite a bit of extra memory being carried around by the driver until you execute the command list. Even so, I think it is worth trying out.

About the terrain rendering, I think it all depends on what you are doing and how it has to work... Tessellation isn't free, but if you use it to reduce a massive amount of vertices from being processed every frame, then it is a net win. Can you specify more specifics about your desired system?
About the terrain rendering, I think it all depends on what you are doing and how it has to work... Tessellation isn't free, but if you use it to reduce a massive amount of vertices from being processed every frame, then it is a net win. Can you specify more specifics about your desired system?[/quote]
Well it's a basic terrain system, yet nothing special, but terrains should be really huge - already have a frustum culling system and some fog, looks good but there are still around 10-30 sectors will always be drawn, each one consists of 67x67=4489 vertices, for the average, 20, 4489* 20 = 89780, i think it's worth reducing?

my current LOD system is very poor, it is merely a couple of index buffers, one for each stage, vertex positions doesn't change, only the triangle count, gives me somewhat good performance, but the problem was that there are gaps between different LOD sectors, for example, one sector has 33 vertices on a side, and another one, next to it has 17, they simply can't stick together without gaps, i was thinking about keeping the outer vertices active while doing reduction in the inner triangles, though, still can't find a good way to do it.

Also, i came across an idea yesterday, i've already heard about vertex shaders being able to sample textures - well then, maybe i could render all terrain sectors at once using an instance, then i just give the heightmap to the shader, and it offset each vertex by the amount at the u,v coords in the map, i'm going to try it soon.

EDIT: edited vertices count, sorry.
sampling textures in the vertex shader is not a good idea - didn't completely work and no performance gain anyway, however, with LOD or without, there is too much work for the CPU, alot of draw calls, i will try to render terrain sectors in multiple threads to see how it would work.

@Jason Z: looks like you're implementing it the same way i was planning to do - i will give it a try
I'm having a little problem - I get an access violation reading location xxxx when i try to ExecuteCommandList from the immediate context, here is my current code:


HANDLE handles[HX_TERRAIN_RENDERING_THREADS_COUNT];
RenderSectors* args[HX_TERRAIN_RENDERING_THREADS_COUNT];
UINT sectorCount[HX_TERRAIN_RENDERING_THREADS_COUNT] = { 0 };
for ( UINT i = 0; i < SectorsToRender.size ( ); i++ )
sectorCount[i%HX_TERRAIN_RENDERING_THREADS_COUNT]++;
for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ )
{
args = new RenderSectors ( );
args->context = _core->GetD3D11DeferredContext ( i );
args->numSectors = sectorCount;
args->sectors = i == 0 ? &SectorsToRender[0] : &SectorsToRender[sectorCount[i-1]];
args->indexBuffers = i == 0 ? &indexBuffersToRender[0] : &indexBuffersToRender[sectorCount[i-1]];

handles = (HANDLE)_beginthread ( __RenderSectors, 0, (void*)args );
}

WaitForMultipleObjects ( HX_TERRAIN_RENDERING_THREADS_COUNT, handles, TRUE, INFINITE );

for ( UINT i = 0; i < HX_TERRAIN_RENDERING_THREADS_COUNT; i++ )
{
_core->GetD3D11DeviceContext()->ExecuteCommandList ( args->commandList, FALSE ); //access violation reading location xxxx
HX_SAFE_FREE ( args->commandList );
HX_SAFE_DELETE ( args );
}


basically, i have 2 vectors containing the buffers required for rendering, i create a new RenderSectors object that holds pointers to those buffers and some variables and a pointer to a deferred context to use, and pass it to the thread so it can render them
and when all threads finish, i ExecuteCommandList from the immediate context, and this is where the error occurs, _core->GetD3D11DeviceContext() returns a valid pointer to the immediate context, and the args->commandList is also valid

also, here is my thread code:

static void __RenderSectors ( void* sectors )
{
RenderSectors* renderSectors = (RenderSectors*)sectors;
if ( !renderSectors )
return;

for ( UINT i = 0; i < renderSectors->numSectors; i++ )
{
//set the mesh data in the input assembler to draw the mesh
UINT Strides[1];
UINT Offsets[1];
ID3D11Buffer* pVB[1];
pVB[0] = renderSectors->sectors->vb;
Strides[0] = sizeof _hxTerrainVertex;
Offsets[0] = 0;
renderSectors->context->IASetVertexBuffers ( 0, 1, pVB, Strides, Offsets );
renderSectors->context->IASetIndexBuffer ( renderSectors->indexBuffers->buffer, DXGI_FORMAT_R32_UINT, 0 );

//draw the mesh
renderSectors->context->DrawIndexed ( renderSectors->indexBuffers->size, 0, 0 );
}

renderSectors->context->FinishCommandList ( FALSE, &renderSectors->commandList );

_endthread ( );
};


looks like i missed a point regarding multi-threaded rendering?

This topic is closed to new replies.

Advertisement