• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By isu diss
      I'm following rastertek tutorial 14 (http://rastertek.com/tertut14.html). The problem is, slope based texturing doesn't work in my application. There are plenty of slopes in my terrain. None of them get slope color.
      float4 PSMAIN(DS_OUTPUT Input) : SV_Target { float4 grassColor; float4 slopeColor; float4 rockColor; float slope; float blendAmount; float4 textureColor; grassColor = txTerGrassy.Sample(SSTerrain, Input.TextureCoords); slopeColor = txTerMossRocky.Sample(SSTerrain, Input.TextureCoords); rockColor = txTerRocky.Sample(SSTerrain, Input.TextureCoords); // Calculate the slope of this point. slope = (1.0f - Input.LSNormal.y); if(slope < 0.2) { blendAmount = slope / 0.2f; textureColor = lerp(grassColor, slopeColor, blendAmount); } if((slope < 0.7) && (slope >= 0.2f)) { blendAmount = (slope - 0.2f) * (1.0f / (0.7f - 0.2f)); textureColor = lerp(slopeColor, rockColor, blendAmount); } if(slope >= 0.7) { textureColor = rockColor; } return float4(textureColor.rgb, 1); } Can anyone help me? Thanks.

    • By cozzie
      Hi all,
      As a part of the debug drawing system in my engine,  I want to add support for rendering simple text on screen  (aka HUD/ HUD style). From what I've read there are a few options, in short:
      1. Write your own font sprite renderer
      2. Using Direct2D/Directwrite, combine with DX11 rendertarget/ backbuffer
      3. Use an external library, like the directx toolkit etc.
      I want to go for number 2, but articles/ documentation confused me a bit. Some say you need to create a DX10 device, to be able to do this, because it doesn't directly work with the DX11 device.  But other articles tell that this was 'patched' later on and should work now.
      Can someone shed some light on this and ideally provide me an example or article on  how to set this up?
      All input is appreciated.
    • By stale
      I've just started learning about tessellation from Frank Luna's DX11 book. I'm getting some very weird behavior when I try to render a tessellated quad patch if I also render a mesh in the same frame. The tessellated quad patch renders just fine if it's the only thing I'm rendering. This is pictured below:
      However, when I attempt to render the same tessellated quad patch along with the other entities in the scene (which are simple triangle-lists), I get the following error:

      I have no idea why this is happening, and google searches have given me no leads at all. I use the following code to render the tessellated quad patch:
      ID3D11DeviceContext* dc = GetGFXDeviceContext(); dc->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_4_CONTROL_POINT_PATCHLIST); dc->IASetInputLayout(ShaderManager::GetInstance()->m_JQuadTess->m_InputLayout); float blendFactors[] = { 0.0f, 0.0f, 0.0f, 0.0f }; // only used with D3D11_BLEND_BLEND_FACTOR dc->RSSetState(m_rasterizerStates[RSWIREFRAME]); dc->OMSetBlendState(m_blendStates[BSNOBLEND], blendFactors, 0xffffffff); dc->OMSetDepthStencilState(m_depthStencilStates[DSDEFAULT], 0); ID3DX11EffectTechnique* activeTech = ShaderManager::GetInstance()->m_JQuadTess->Tech; D3DX11_TECHNIQUE_DESC techDesc; activeTech->GetDesc(&techDesc); for (unsigned int p = 0; p < techDesc.Passes; p++) { TerrainVisual* terrainVisual = (TerrainVisual*)entity->m_VisualComponent; UINT stride = sizeof(TerrainVertex); UINT offset = 0; GetGFXDeviceContext()->IASetVertexBuffers(0, 1, &terrainVisual->m_VB, &stride, &offset); Vector3 eyePos = Vector3(cam->m_position); Matrix rotation = Matrix::CreateFromYawPitchRoll(entity->m_rotationEuler.x, entity->m_rotationEuler.y, entity->m_rotationEuler.z); Matrix model = rotation * Matrix::CreateTranslation(entity->m_position); Matrix view = cam->GetLookAtMatrix(); Matrix MVP = model * view * m_ProjectionMatrix; ShaderManager::GetInstance()->m_JQuadTess->SetEyePosW(eyePos); ShaderManager::GetInstance()->m_JQuadTess->SetWorld(model); ShaderManager::GetInstance()->m_JQuadTess->SetWorldViewProj(MVP); activeTech->GetPassByIndex(p)->Apply(0, GetGFXDeviceContext()); GetGFXDeviceContext()->Draw(4, 0); } dc->RSSetState(0); dc->OMSetBlendState(0, blendFactors, 0xffffffff); dc->OMSetDepthStencilState(0, 0); I draw my scene by looping through the list of entities and calling the associated draw method depending on the entity's "visual type":
      for (unsigned int i = 0; i < scene->GetEntityList()->size(); i++) { Entity* entity = scene->GetEntityList()->at(i); if (entity->m_VisualComponent->m_visualType == VisualType::MESH) DrawMeshEntity(entity, cam, sun, point); else if (entity->m_VisualComponent->m_visualType == VisualType::BILLBOARD) DrawBillboardEntity(entity, cam, sun, point); else if (entity->m_VisualComponent->m_visualType == VisualType::TERRAIN) DrawTerrainEntity(entity, cam); } HR(m_swapChain->Present(0, 0)); Any help/advice would be much appreciated!
    • By KaiserJohan
      Am trying a basebones tessellation shader and getting unexpected result when increasing the tessellation factor. Am rendering a group of quads and trying to apply tessellation to them.
      OutsideTess = (1,1,1,1), InsideTess= (1,1)

      OutsideTess = (1,1,1,1), InsideTess= (2,1)

      I expected 4 triangles in the quad, not two. Any idea of whats wrong?
      struct PatchTess { float mEdgeTess[4] : SV_TessFactor; float mInsideTess[2] : SV_InsideTessFactor; }; struct VertexOut { float4 mWorldPosition : POSITION; float mTessFactor : TESS; }; struct DomainOut { float4 mWorldPosition : SV_POSITION; }; struct HullOut { float4 mWorldPosition : POSITION; }; Hull shader:
      PatchTess PatchHS(InputPatch<VertexOut, 3> inputVertices) { PatchTess patch; patch.mEdgeTess[ 0 ] = 1; patch.mEdgeTess[ 1 ] = 1; patch.mEdgeTess[ 2 ] = 1; patch.mEdgeTess[ 3 ] = 1; patch.mInsideTess[ 0 ] = 2; patch.mInsideTess[ 1 ] = 1; return patch; } [domain("quad")] [partitioning("fractional_odd")] [outputtopology("triangle_ccw")] [outputcontrolpoints(4)] [patchconstantfunc("PatchHS")] [maxtessfactor( 64.0 )] HullOut hull_main(InputPatch<VertexOut, 3> verticeData, uint index : SV_OutputControlPointID) { HullOut ret; ret.mWorldPosition = verticeData[index].mWorldPosition; return ret; }  
      Domain shader:
      [domain("quad")] DomainOut domain_main(PatchTess patchTess, float2 uv : SV_DomainLocation, const OutputPatch<HullOut, 4> quad) { DomainOut ret; const float MipInterval = 20.0f; ret.mWorldPosition.xz = quad[ 0 ].mWorldPosition.xz * ( 1.0f - uv.x ) * ( 1.0f - uv.y ) + quad[ 1 ].mWorldPosition.xz * uv.x * ( 1.0f - uv.y ) + quad[ 2 ].mWorldPosition.xz * ( 1.0f - uv.x ) * uv.y + quad[ 3 ].mWorldPosition.xz * uv.x * uv.y ; ret.mWorldPosition.y = quad[ 0 ].mWorldPosition.y; ret.mWorldPosition.w = 1; ret.mWorldPosition = mul( gFrameViewProj, ret.mWorldPosition ); return ret; }  
      Any ideas what could be wrong with these shaders?
    • By simco50
      I've stumbled upon Urho3D engine and found that it has a really nice and easy to read code structure.
      I think the graphics abstraction looks really interesting and I like the idea of how it defers pipeline state changes until just before the draw call to resolve redundant state changes.
      This is done by saving the state changes (blendEnabled/SRV changes/RTV changes) in member variables and just before the draw, apply the actual state changes using the graphics context.
      It looks something like this (pseudo):
      void PrepareDraw() { if(renderTargetsDirty) { pD3D11DeviceContext->OMSetRenderTarget(mCurrentRenderTargets); renderTargetsDirty = false } if(texturesDirty) { pD3D11DeviceContext->PSSetShaderResourceView(..., mCurrentSRVs); texturesDirty = false } .... //Some more state changes } This all looked like a great design at first but I've found that there is one big issue with this which I don't really understand how it is solved in their case and how I would tackle it.
      I'll explain it by example, imagine I have two rendertargets: my backbuffer RT and an offscreen RT.
      Say I want to render my backbuffer to the offscreen RT and then back to the backbuffer (Just for the sake of the example).
      You would do something like this:
      //Render to the offscreen RT pGraphics->SetRenderTarget(pOffscreenRT->GetRTV()); pGraphics->SetTexture(diffuseSlot, pDefaultRT->GetSRV()) pGraphics->DrawQuad() pGraphics->SetTexture(diffuseSlot, nullptr); //Remove the default RT from input //Render to the default (screen) RT pGraphics->SetRenderTarget(nullptr); //Default RT pGraphics->SetTexture(diffuseSlot, pOffscreenRT->GetSRV()) pGraphics->DrawQuad(); The problem here is that the second time the application loop comes around, the offscreen rendertarget is still bound as input ShaderResourceView when it gets set as a RenderTargetView because in Urho3D, the state of the RenderTargetView will always be changed before the ShaderResourceViews (see top code snippet) even when I set the SRV to nullptr before using it as a RTV like above causing errors because a resource can't be bound to both input and rendertarget.
      What is usually the solution to this?
  • Advertisement
  • Advertisement
Sign in to follow this  

DX11 [DX11] Instancing slows down instead of speeding up

This topic is 2902 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hai, I'm rendering 400 objects with the same textures/indices/vertices. Usually when rendering this you'd need 400 draw calls, and it ran at 60+ fps. I figured when I made it render with instancing it would get quite a speedup. So after changing stuff a bit it now renders the 400 objects with only 2 draw calls, but the fps went to 30. I ran it trough GPUPerfStudio and it said my fps was limited by my draw calls (when instancing), which doesn't make a lot of sense to me. How can 400 draw calls be fast and not be bottlenecking my code, whereas having only 2 draw calls bottleneck it? Isn't that what instancing is for? To reduce the amount of draw calls needed? I'm instancing by filling a cbuffer with 256 world matrices and sending it to the shader, where it uses SV_InstanceID to get the appropriate world matrix from the cbuffer. The cpu only runs at 40% or so while the app is running so that doesn't seem to be the bottleneck. I've also tweaked the amount of instances that get rendered at the same time, 10, 20, 256, all of them seem to be severely slower than just rendering them normally. So here comes the question: How can using instancing for this slow my app down instead of speeding it up? Am i doing something wrong here or..?

Share this post

Link to post
Share on other sites
i belive some setting is doing this! ( i dont know which ).
becus i made an ~15k objects render with around 100 fps.

Share this post

Link to post
Share on other sites
Is that with instancing or without it?

By the way, the normal way to do instancing is to have a separate stream with the instance data. Reading the data from a cbuffer is probably more costly.

Share this post

Link to post
Share on other sites
Have you tried this with storing the matrices into a texture buffer instead? I don't know if this causes design problems on your end, but it might be interesting to look at.

I too find it strange you get these result, instancing should indeed decrease the workload and increase the framerate in the way you describe your methods.

There is indeed some performance issues to be taking in account regarding cbuffers, but non should be that dramatic to the end result.

It might be helpful if you could provide some (pseudo-)code of your initialization and rendering procedures.

Share this post

Link to post
Share on other sites
Maybe instancing isn't supported by the GFX card so the driver is doing it in software / without hardware acceleration?

Share this post

Link to post
Share on other sites
I'll have a look at doing it the 'normal' way, with a seperate stream.
The Nvidia "SkinnedInstancing" demo does it using a huge cbuffer so i figured that was a fast way to do it.

I don't think it would be that much work to see what happens if i try it with a texture, but isn't writing to a texture on the gpu a lot slower than working with a cbuffer? (which are meant to be written to). I guess it would make an interesting test code though.

Here's some code:

Cbuffer creation:

D3D11_BUFFER_DESC gpuBufDesc;
gpuBufDesc.Usage = D3D11_USAGE_DEFAULT;
gpuBufDesc.ByteWidth = desc.Size;
gpuBufDesc.CPUAccessFlags = 0;
gpuBufDesc.BindFlags = D3D11_BIND_CONSTANT_BUFFER;
gpuBufDesc.MiscFlags = 0;
gpuBufDesc.StructureByteStride = 0;

dev->CreateBuffer(&gpuBufDesc, nullptr, &gpuBuffer))

Updating of the cbuffer with new data:

context->UpdateSubresource(gpuBuffer, 0, nullptr, memoryBuffer->getBuffer(), memoryBuffer->getSize(), 0);

And here's the draw:

context->DrawIndexedInstanced(mat.getMeshBuffer()->getIndexCount(), instanceCount, 0, 0, 0);
context->DrawIndexed(mat.getMeshBuffer()->getIndexCount(), 0 , 0);

Some of the HLSL:

struct InstanceStruct
matrix World : World;

cbuffer PerInstanceCB
InstanceStruct InstanceData[MAX_INSTANCE_CONSTANTS] : InstanceData;

output.Pos = mul(input.Pos, InstanceData[input.IID].World);
output.Pos = mul(output.Pos, View);
output.Pos = mul(output.Pos, Projection);

(And yes i know its faster to make a ViewProjection and multiply with that instead :))

I have a HD4850 with the latest catalyst drivers, so i believe it is support. It may be possible however that the dx11 drivers dont quite support it properly yet though. But wouldn't my cpu usage be skyrocketing then?

Share this post

Link to post
Share on other sites
Any program using DirectX with a standard game loop will use close to 100% of one core of the CPU, unless you've written extra code to change that.

The reason is that the CPU will busy wait for the GPU if the GPU gets ahead to give the minimum delay between the GPU becoming ready for more data and your code getting run again.

Share this post

Link to post
Share on other sites
So first you have pointed out something interesting across the thread - which is that you are taking code from an NVIDIA demo and running it on an ATI card and expecting similar results. In this case those that posted before me about the cbuffer being your issue are probably pointing you in the right direction. It's not uncommon for the companies to post demo code that runs well on their hardware and poorly on the other guys hardware. -- the NVIDIA demo doesn't imply that there are other possibly even faster ways of doing this same work. You'll have to do a test on both sets of hardware to see what works best between them or else write two shaders, one for each IHV (which is a fairly normal thing to have to do if performance is important)

First, a note on transferring instance data to the GPU. There really shouldn't be any difference in data transfer speeds between a cbuffer and a texture. Both require writing data in blocks and doing DMA transfer, but there really isn't anything interesting there about getting data from the CPU to the GPU. When transferring lots of data to the GPU you might want to consider using a dynamic buffer. A dynamic buffer will give the driver more flexibility in scheduling the data transfer and in this case will also let you transfer a variable amount of data to the card depending on the number of instances you want to draw each frame. A cbuffer will force you to send the same amount of data as your HLSL declared every time, so you'll always be paying the maximum cost even if half of the data ends up being zeros.

So why might cbuffers be a problem? cbuffers have a very different cache structure from a texture. A cbuffer is optimized for in order access of constants while a tbuffer is optimized for random (with locality) access . It's possible that for every instance is starting off with a cache miss on the cbuffer since your index into them maybe quite different than what the driver/hardware expect for maximum performance (The driver might be preloading data based on expectation), but you'd have to code up a different solution to find out. Tbuffers may provide better cache hits and would allow partial updates and so could perform better overall. However I think you should also try using instanced dynamic vertex buffers for your data since those might have the best behavior (since the were designed to optimize this scenario). However there are occasional reports about people finding texture based model attributes perform better than the input assembler. But those reports are when using textures or tbuffers for data, not cbuffers. (this might actually depend more on if the mesh data is optimized for vertex caching or not) -- going further into this, the number of attributes in the vertex data also affects utilization of the vertex caching on some hardware.

Since there are so many variables here it can be hard to figure out the right slice to get the best performance so you may just have to try quite a few things. Be careful about expecting any one technique to work well on all cards -- especially between IHVs -- because this isn't often the case.

Share this post

Link to post
Share on other sites
Well i expected the cbuffers to work well because the nvidia instancing demo works really fast here, but i guess it was rendering less than i am, and it does more stuff than just rendering plain meshes..

I just stumbled across this piece of info in the "A to Z of DX Performance" presentation of Nvidia, and it says this:

Instance data:
ATI: Ideally should come from additional streams (up to 32 with DX10.1)
NVIDIA: Ideally should come from CB indexing

So I guess 'CB indexing' which is what i am doing is faster on Nvidia cards than on ATI cards, but i didnt expect THIS much of a performance decrease. I'll add instancing streams to my engine for ATI and see if it works better or not (if i figure out how, anyway)

Share this post

Link to post
Share on other sites
Sign in to follow this  

  • Advertisement