Jump to content
  • Advertisement

mlfarrell00

Member
  • Content Count

    13
  • Joined

  • Last visited

Community Reputation

137 Neutral

About mlfarrell00

  • Rank
    Member
  1. mlfarrell00

    D3D12 atrocious performance

    wow.. so get this.. I've been doing this work on a mac rebooted into windows 10 via bootcamp (not a VM, actually running in windows on mac hardware).  The same exact OpenGL demo when rebooted into OSX gets 20- 23 frames per second.  Confirms what everyone already knows.  Apple gives zero fucks about optimizing their OpenGL drivers.  I knew nvidia had an edge on apples drivers but this is ridiculous.  One of the reasons I scouted ahead to learn D3D12 was to get an edge on the concepts that will likely be available when vulkan releases, but with apple touting their vender-lockin metal API, its a wonder if we'll ever see vulkan on OSX at all.  Food for thought I guess.  Either way, if metal becomes the ONLY way to get performant 3D on OSX, I'll be abandoning it in favor of windows in a heatbeat.
  2. mlfarrell00

    D3D12 atrocious performance

    Bam!  I finally beat OpenGL.  Man the NVIDIA developers of the OpenGL driver on windows are on point, that's all I'm gonna say, cause this was a bitch.   Even after latency fixes, and large heap allocations, I had to do tons of CPU-bound optimizations to thin-out my VGL layer as much as possible.  Things like STL containers were big bottlenecks.  Things I already knew but don't think about until optimization needs hit a certain level.   On OpenGL, rendering 5000 unique objects via 5000 separate draw calls amounts to about 48-52 FPS now on my D3D12 backend, I'm achieving up to 57 FPS   I finally beat it! There's likely even more room for optimization on the D3D12 side so I'm happy.   What I'm doing basically is allowing a D3D12 renderer backend to my graphics engine which allows me to Open scene files that I made with my app (http://vertostudio.com).  The scenes were created using an OpenGL ES variant of the same engine.  Being able to load them up in a D3D12 environment with good performance is awesome.  In fact, the actual same C++ graphics engine has been built into JS (via emscripten) and runs on that same website inside of the cloud viewer.     This whole experiment was about the continuation of my goals to make my graphics engine as platform-independent and versatile as possible - by offering different rendering backends for different systems.  Now I've got OpenGL 3, Webgl, and D3d12, and soon to be vulkan.   Here's my "final" core state class below for reference. #include "vgl.h" #include "System.h" #include "CoreStateMachine.h" #include "BufferArray.h" #include "FrameBuffer.h" #include "VertexArray.h" #include "Texture.h" using namespace std; using namespace Microsoft::WRL; namespace vgl { static const int GPUDescriptorHeapSize = 2048; static const int GlobalCBufferMaxSize = 4096; static const int GlobalCBufferMaxCalls = 10000; static const int TriangleFanEmulationBufferSize = 1000000 * 4; CoreStateMachine::CoreStateMachine() { //init cannot be done until we have the device & queue pointers } CoreStateMachine::~CoreStateMachine() { } void CoreStateMachine::shutdown() { //must be done BEFORE destruction for(int i = 0; i < MaxLatencyFrames; i++) { if(globalConstantBuffers[currentFrameIndex]) { globalConstantBuffers[i]->buffers[0]->Unmap(0, nullptr); } } } void CoreStateMachine::performDeferredInit() { D3D12_FEATURE_DATA_D3D12_OPTIONS featureOps; ThrowIfFailed(device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &featureOps, sizeof(featureOps))); if((int)featureOps.ResourceBindingTier < (int)D3D12_RESOURCE_BINDING_TIER_2) { MessageBox(NULL, L"D3D12 Resource Tier 2 support required and not found!", L"It's over", MB_ICONERROR | MB_OK); throw vgl_runtime_error("It's over"); } ThrowIfFailed(device->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_DIRECT, IID_PPV_ARGS(&setupCommandAllocator))); for(int i = 0; i < MaxLatencyFrames; i++) ThrowIfFailed(device->CreateCommandAllocator(D3D12_COMMAND_LIST_TYPE_DIRECT, IID_PPV_ARGS(&renderCommandAllocator[i]))); CD3DX12_DESCRIPTOR_RANGE descRange1, descRange2, descRange3; descRange1.Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, MaxConstantBuffers, 1); descRange2.Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, D3D12_DESCRIPTOR_RANGE_OFFSET_APPEND, 0); descRange3.Init(D3D12_DESCRIPTOR_RANGE_TYPE_SAMPLER, D3D12_DESCRIPTOR_RANGE_OFFSET_APPEND, 0); //4 or so constant buffers & 8 textures possible per draw call textureTableSize = MaxConstantBuffers + MaxTextures; CD3DX12_ROOT_PARAMETER rootParam[3]; CD3DX12_DESCRIPTOR_RANGE ranges[2] = { descRange1, descRange2 }; rootParam[0].InitAsDescriptorTable(2, ranges); rootParam[1].InitAsDescriptorTable(1, &descRange3); rootParam[2].InitAsConstantBufferView(0); CD3DX12_ROOT_SIGNATURE_DESC rootSignatureDesc; rootSignatureDesc.Init(3, rootParam, 0, nullptr, D3D12_ROOT_SIGNATURE_FLAG_ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT); ComPtr<ID3DBlob> signature; ComPtr<ID3DBlob> error; ThrowIfFailed(D3D12SerializeRootSignature(&rootSignatureDesc, D3D_ROOT_SIGNATURE_VERSION_1, &signature, &error)); ThrowIfFailed(device->CreateRootSignature(0, signature->GetBufferPointer(), signature->GetBufferSize(), IID_PPV_ARGS(&rootSignature))); // Describe and create the graphics pipeline state object (PSO). psoDesc = {}; psoDesc.pRootSignature = rootSignature.Get(); psoDesc.RasterizerState = CD3DX12_RASTERIZER_DESC(D3D12_DEFAULT); psoDesc.RasterizerState.CullMode = D3D12_CULL_MODE_NONE; psoDesc.BlendState = CD3DX12_BLEND_DESC(D3D12_DEFAULT); psoDesc.DepthStencilState.DepthEnable = FALSE; psoDesc.DepthStencilState.DepthWriteMask = D3D12_DEPTH_WRITE_MASK_ALL; psoDesc.DepthStencilState.StencilEnable = FALSE; psoDesc.DepthStencilState.DepthFunc = D3D12_COMPARISON_FUNC_LESS; psoDesc.SampleMask = UINT_MAX; psoDesc.PrimitiveTopologyType = D3D12_PRIMITIVE_TOPOLOGY_TYPE_TRIANGLE; psoDesc.NumRenderTargets = 1; psoDesc.RTVFormats[0] = DXGI_FORMAT_R8G8B8A8_UNORM; psoDesc.DSVFormat = DXGI_FORMAT_D32_FLOAT; psoDesc.SampleDesc.Count = 1; psoDirty = true; setupFenceValue = 0; ThrowIfFailed(device->CreateFence(setupFenceValue, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&setupFence))); renderFenceValue = 0; ThrowIfFailed(device->CreateFence(0, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&renderFence))); currentFrameIndex = 0; ThrowIfFailed(device->CreateFence(0, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&frameFence))); // Create an event handle to use for frame synchronization. setupFenceEvent = CreateEventEx(nullptr, FALSE, FALSE, EVENT_ALL_ACCESS); if(setupFenceEvent == nullptr) { ThrowIfFailed(HRESULT_FROM_WIN32(GetLastError())); } renderFenceEvent = CreateEventEx(nullptr, FALSE, FALSE, EVENT_ALL_ACCESS); if(renderFenceEvent == nullptr) { ThrowIfFailed(HRESULT_FROM_WIN32(GetLastError())); } frameFenceEvent = CreateEventEx(nullptr, FALSE, FALSE, EVENT_ALL_ACCESS); if(frameFenceEvent == nullptr) { ThrowIfFailed(HRESULT_FROM_WIN32(GetLastError())); } auto cl = beginRenderingCommands(); triangleFanEBOs = make_shared<BufferArray>(2); ushort3 eboData[2] = { { 0, 1, 2 }, { 0, 2, 3 } }; triangleFanEBOs->provideData(0, sizeof(ushort3) * 2, eboData, BufferArray::UT_STATIC); triangleFanEBOs->provideData(1, TriangleFanEmulationBufferSize, nullptr, BufferArray::UT_FORCE_UPLOAD_HEAP); triangleFanEBOs->setInternalUsage(true); endRenderingCommands(cl); cbSrvHeap = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV, GPUDescriptorHeapSize, true); cpuCbSrvHeap = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV, textureTableSize, false); cbSrvHeaps = { cbSrvHeap }; samplerHeap = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER, GPUDescriptorHeapSize, true); cpuSamplerHeap = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER, textureTableSize, false); samplerHeaps = { samplerHeap }; vom::Texture::setHeaps(cbSrvHeap, samplerHeap); waitForSetupCommands(); waitForRender(); } void CoreStateMachine::setBlendFuncSourceFactor(BlendFactor srcFactor, BlendFactor dstFactor) { D3D12_BLEND blendFactors[] = { D3D12_BLEND_ONE, D3D12_BLEND_SRC_ALPHA, D3D12_BLEND_INV_SRC_ALPHA }; auto bs = psoDesc.BlendState.RenderTarget[0]; if(psoDesc.BlendState.RenderTarget[0].SrcBlend != blendFactors[(int)srcFactor] || psoDesc.BlendState.RenderTarget[0].DestBlend != blendFactors[(int)dstFactor]) { psoDesc.BlendState.RenderTarget[0].SrcBlend = blendFactors[(int)srcFactor]; psoDesc.BlendState.RenderTarget[0].SrcBlendAlpha = blendFactors[(int)srcFactor]; psoDesc.BlendState.RenderTarget[0].DestBlend = blendFactors[(int)dstFactor]; psoDesc.BlendState.RenderTarget[0].DestBlendAlpha = blendFactors[(int)dstFactor]; psoDirty = true; } } void CoreStateMachine::setInputLayout(const std::vector<D3D12_INPUT_ELEMENT_DESC> &descs) { bool changed = psoInputLayout.size() != descs.size(); if(!changed) { for(int i = 0; i < descs.size(); i++) { auto &da = descs[i]; auto &db = psoInputLayout[i]; //fuck wasting more CPU time right now, leaving out the semantic name string comparison /* if(da.AlignedByteOffset != db.AlignedByteOffset || da.Format != db.Format || da.InstanceDataStepRate != db.InstanceDataStepRate || da.InputSlot != db.InputSlot || da.InputSlotClass != db.InputSlotClass || da.SemanticIndex != db.SemanticIndex || (string)da.SemanticName != (string)db.SemanticName)*/ if(da.AlignedByteOffset != db.AlignedByteOffset || da.Format != db.Format || da.InstanceDataStepRate != db.InstanceDataStepRate || da.InputSlot != db.InputSlot || da.InputSlotClass != db.InputSlotClass || da.SemanticIndex != db.SemanticIndex) { changed = true; } } } if(changed) { psoInputLayout = descs; psoDesc.InputLayout = { psoInputLayout.data(), (UINT)psoInputLayout.size() }; psoDirty = true; } } void CoreStateMachine::setShaders(ShaderEffect::ShaderProgram *shaders) { if(currentProgram != shaders) { currentProgram = shaders; if(shaders) { psoDesc.VS = { reinterpret_cast<UINT8*>(shaders->vertexShader.blob->GetBufferPointer()), shaders->vertexShader.blob->GetBufferSize() }; psoDesc.PS = { reinterpret_cast<UINT8*>(shaders->pixelShader.blob->GetBufferPointer()), shaders->pixelShader.blob->GetBufferSize() }; } psoDesc.CachedPSO = {}; psoDirty = true; } } void CoreStateMachine::commitPipelineStateChanges() { auto cl = continueRenderingCommands(); ThrowIfFailed(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState))); cl->SetPipelineState(pipelineState.Get()); } void CoreStateMachine::enableDepthTesting(bool b) { if(psoDesc.DepthStencilState.DepthEnable != b) { psoDesc.DepthStencilState.DepthEnable = b; psoDirty = true; } } void CoreStateMachine::setViewport(int x, int y, int w, int h) { auto cl = continueRenderingCommands(); bool close = false; if(!cl) { cl = beginRenderingCommands(); close = true; } D3D12_VIEWPORT vp = { x, y, w, h, 0, 1 }; D3D12_RECT scissor = { x, y, w, h }; //not 100% on this scissor rect, later on obtain from current FB viewport = { x, y, w, h }; cl->RSSetViewports(1, &vp); cl->RSSetScissorRects(1, &scissor); if(close) { endRenderingCommands(cl); waitForRender(); } } int4 CoreStateMachine::getViewport() { return viewport; } void CoreStateMachine::setColorMask(bool r, bool g, bool b, bool a) { } void CoreStateMachine::setDepthMask(bool mask) { } void CoreStateMachine::setCullFace(bool cullFace) { auto cf = cullFace ? D3D12_CULL_MODE_BACK : D3D12_CULL_MODE_NONE; if(psoDesc.RasterizerState.CullMode != cf) { psoDesc.RasterizerState.CullMode = cf; psoDirty = true; } } bool CoreStateMachine::enableBlending(bool b) { if(b != blendingOn) { if(psoDesc.BlendState.RenderTarget[0].BlendEnable != b) { psoDesc.BlendState.IndependentBlendEnable = FALSE; psoDesc.BlendState.RenderTarget[0].BlendEnable = b; blendingOn = b; psoDirty = true; } //state was changed return true; } return false; } void CoreStateMachine::drawIndexedPrimitives(PrimitiveType type, size_t count, IndexFormat format, size_t bufferOffsetInBytes) { if(!count) return; static const D3D12_PRIMITIVE_TOPOLOGY mode[] = { D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, //fan D3D_PRIMITIVE_TOPOLOGY_LINELIST, D3D_PRIMITIVE_TOPOLOGY_POINTLIST }; auto cl = continueRenderingCommands(); UINT startIndexLocation = 0; INT baseVertexLocation = 0; if(bufferOffsetInBytes) { size_t indSz = 0; if(format == IF_USHORT) { indSz = sizeof(unsigned short); } else { indSz = sizeof(unsigned int); } startIndexLocation = bufferOffsetInBytes / indSz; } if(type == PT_TRIANGLE_FAN) { //this is a nitemare, and anyone stupid enough to use it for performance-critical applications deserves //this performance penalty assert(count == 4); auto currentIndexDataCPU = BufferArray::getCurrentIndexBufferUpload(); UINT8 *data = nullptr; UINT fanInds[6]; ThrowIfFailed(currentIndexDataCPU->Map(0, nullptr, reinterpret_cast<void **>(&data))); if(format == IF_USHORT) { for(int i = 0; i < 4; i++) fanInds[i] = ((USHORT *)data)[startIndexLocation+i]; } else { for(int i = 0; i < 4; i++) fanInds[i] = ((UINT *)data)[startIndexLocation+i]; } currentIndexDataCPU->Unmap(0, nullptr); //0,1,2, 0,2,3 D3D12_RANGE range = { fanEboOffset*sizeof(uint), fanEboOffset*sizeof(uint) + sizeof(uint) * 6 }; fanInds[5] = fanInds[3]; fanInds[3] = fanInds[0]; fanInds[4] = fanInds[2]; ThrowIfFailed(triangleFanEBOs->buffers[1]->Map(0, nullptr, reinterpret_cast<void **>(&data))); memcpy(data+(fanEboOffset*sizeof(uint)), fanInds, sizeof(uint3) * 2); triangleFanEBOs->buffers[1]->Unmap(0, &range); if(range.End >= TriangleFanEmulationBufferSize) { throw vgl_runtime_error("You're drawing way too many emulated triangle-fan quads in one frame which is EXTREMELY inefficient anwyay"); } if(fanEbosSet != 2) { triangleFanEBOs->setAsIndexBuffer(1, false); fanEbosSet = 2; } count = 6; startIndexLocation = fanEboOffset; baseVertexLocation = 0; fanEboOffset += 6; format = IF_UINT; } cl->IASetPrimitiveTopology(mode[(int)type]); cl->DrawIndexedInstanced(count, 1, startIndexLocation, baseVertexLocation, 0); drawCallIndex++; if(shouldAdvanceOnDraw) { advanceTextureTable(); shouldAdvanceOnDraw = false; } /*descriptorTablesChanged = true; descriptorHeapsChanged = true; endRenderingCommands(cl); beginRenderingCommands();*/ } void CoreStateMachine::drawPrimitiveArray(PrimitiveType type, size_t count, size_t offsetInElements) { static const D3D12_PRIMITIVE_TOPOLOGY mode[] = { D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST, //fan D3D_PRIMITIVE_TOPOLOGY_LINELIST, D3D_PRIMITIVE_TOPOLOGY_POINTLIST }; auto cl = continueRenderingCommands(); if(type == PT_TRIANGLE_FAN) { assert(count == 4); if(fanEbosSet != 1) { triangleFanEBOs->setAsIndexBuffer(0, true); fanEbosSet = 1; } drawIndexedPrimitives(PT_TRIANGLES, 6, IF_USHORT, 0); return; } cl->IASetPrimitiveTopology(mode[(int)type]); cl->DrawInstanced(count, 1, offsetInElements, 0); drawCallIndex++; if(shouldAdvanceOnDraw) { advanceTextureTable(); shouldAdvanceOnDraw = false; } /*endRenderingCommands(cl); ThrowIfFailed(commandQueue->Signal(drawCallFence.Get(), drawCallFenceValue + 1)); drawCallFenceValue++; beginRenderingCommands();*/ } void CoreStateMachine::beginFrame() { currentFrameCount++; currentFrameIndex = currentFrameCount % MaxLatencyFrames; //wait until we've caught up latency-wise if(currentFrameCount > MaxLatencyFrames) { frameFence->SetEventOnCompletion(currentFrameCount - MaxLatencyFrames, frameFenceEvent); DWORD wait = WaitForSingleObject(frameFenceEvent, 10000); if(wait != WAIT_OBJECT_0) throw vgl_runtime_error("Failed WaitForSingleObject(). Pipeline froze up."); ThrowIfFailed(renderCommandAllocator[currentFrameIndex]->Reset()); //drop any resources needed by completed frame renderNeededResources[currentFrameIndex].clear(); } auto fb = dynamic_pointer_cast<vom::FrameBuffer>(vom::FrameBuffer::getScreen()); auto cl = beginRenderingCommands(); //reset some per-frame values and incrementers descriptorHeapIndex = 0; textureTableIndex = 0; drawCallIndex = 0; shouldAdvanceOnDraw = false; fanEbosSet = 0; fanEboOffset = 0; descriptorTablesChanged = descriptorHeapsChanged = true; clPsoDirty = true; fb->prepareForDraw(); } void CoreStateMachine::endFrame() { auto fb = dynamic_pointer_cast<vom::FrameBuffer>(vom::FrameBuffer::getScreen()); fb->prepareForPresent(); endRenderingCommands(defaultRenderCommandList); } void CoreStateMachine::beginSetup() { //seems fine to use the render queue for this too... beginRenderingCommands(); } void CoreStateMachine::endSetup(bool wait) { endRenderingCommands(defaultRenderCommandList); if(wait) waitForRender(); } ComPtr<ID3D12GraphicsCommandList> CoreStateMachine::beginSetupCommands() { ComPtr<ID3D12GraphicsCommandList> commandList; ThrowIfFailed(device->CreateCommandList(0, D3D12_COMMAND_LIST_TYPE_DIRECT, setupCommandAllocator.Get(), nullptr, IID_PPV_ARGS(&commandList))); setupCommandLists.push_back(commandList); activeSetupCommandLists.push(commandList); return commandList; } ComPtr<ID3D12GraphicsCommandList> CoreStateMachine::continueSetupCommands() { return activeSetupCommandLists.top(); } void CoreStateMachine::endSetupCommands(ComPtr<ID3D12GraphicsCommandList> commandList) { // Execute the outermost command list. /*vector<ID3D12CommandList *> ppSetupCommandLists(setupCommandLists.size()); for(int i = 0; i < setupCommandLists.size(); i++) { auto &cl = setupCommandLists[i]; ppSetupCommandLists[i] = cl.Get(); } setupCommandQueue->ExecuteCommandLists(setupCommandLists.size(), ppSetupCommandLists.data());*/ commandList->Close(); ID3D12CommandList *ppCommandLists[] = { commandList.Get() }; commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists); activeSetupCommandLists.pop(); } void CoreStateMachine::waitForSetupCommands() { // Signal and increment the fence value. const UINT64 fence = setupFenceValue; ThrowIfFailed(commandQueue->Signal(setupFence.Get(), fence)); setupFenceValue++; // Wait until the previous frame is finished. if(setupFence->GetCompletedValue() < fence) { ThrowIfFailed(setupFence->SetEventOnCompletion(fence, setupFenceEvent)); WaitForSingleObject(setupFenceEvent, INFINITE); } setupCommandLists.clear(); while(!activeSetupCommandLists.empty()) activeSetupCommandLists.pop(); ThrowIfFailed(setupCommandAllocator->Reset()); } ComPtr<ID3D12GraphicsCommandList> CoreStateMachine::beginRenderingCommands() { if(!defaultRenderCommandList) { ThrowIfFailed(device->CreateCommandList(0, D3D12_COMMAND_LIST_TYPE_DIRECT, renderCommandAllocator[currentFrameIndex].Get(), nullptr, IID_PPV_ARGS(&defaultRenderCommandList))); } else { ThrowIfFailed(defaultRenderCommandList->Reset(renderCommandAllocator[currentFrameIndex].Get(), nullptr)); } defaultRenderCommandListAvailable = true; renderFenceValue++; return defaultRenderCommandList; } ComPtr<ID3D12GraphicsCommandList> CoreStateMachine::continueRenderingCommands() { if(!defaultRenderCommandListAvailable) return nullptr; return defaultRenderCommandList; } void CoreStateMachine::endRenderingCommands(ComPtr<ID3D12GraphicsCommandList> commandList) { ThrowIfFailed(commandList->Close()); ID3D12CommandList *ppCommandLists[] = { commandList.Get() }; commandQueue->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists); defaultRenderCommandListAvailable = false; ThrowIfFailed(commandQueue->Signal(renderFence.Get(), renderFenceValue)); ThrowIfFailed(commandQueue->Signal(frameFence.Get(), currentFrameCount)); } void CoreStateMachine::waitForRender() { // Signal and increment the fence value. const UINT64 fence = renderFenceValue; ThrowIfFailed(commandQueue->Signal(renderFence.Get(), fence)); // Wait until the previous frame is finished. if(renderFence->GetCompletedValue() < fence) { ThrowIfFailed(renderFence->SetEventOnCompletion(fence, renderFenceEvent)); WaitForSingleObject(renderFenceEvent, INFINITE); } for(int i = 0; i < MaxLatencyFrames; i++) { renderNeededResources[i].clear(); ThrowIfFailed(renderCommandAllocator[i]->Reset()); } descriptorHeapIndex = 0; textureTableIndex = 0; drawCallIndex = 0; shouldAdvanceOnDraw = false; fanEbosSet = 0; fanEboOffset = 0; descriptorTablesChanged = descriptorHeapsChanged = true; } void CoreStateMachine::preserveResourceUntilRenderComplete(ComPtr<ID3D12Pageable> resource) { renderNeededResources[currentFrameIndex].push_back(resource); } void CoreStateMachine::presentAndSwapBuffers(bool waitForFrame) { auto fb = dynamic_pointer_cast<vom::FrameBuffer>(vom::FrameBuffer::getScreen()); fb->getSwapChain()->Present(1, 0); if(waitForFrame) { //waitForRender(); fb->updateSwapFrameIndex(); } } void CoreStateMachine::advanceTextureTable() { textureTableIndex += textureTableSize; descriptorTablesChanged = true; //check if ran out of descriptor table space if(textureTableIndex + textureTableSize >= cbSrvHeap->getSize()) { textureTableIndex = 0; descriptorHeapIndex++; descriptorHeapsChanged = true; if(cbSrvHeaps.size() < descriptorHeapIndex + 1) { auto cbSrv = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV, GPUDescriptorHeapSize, true); auto sampler = make_shared<DescriptorHeap>(device.Get(), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER, GPUDescriptorHeapSize, true); cbSrvHeaps.push_back(cbSrv); samplerHeaps.push_back(sampler); if(DebugBuild()) { vout << "Forced to grow number of GPU heaps to " << descriptorHeapIndex + 1 << endl; } } } } void CoreStateMachine::setShouldAdvanceTablesOnDraw() { shouldAdvanceOnDraw = true; } UINT CoreStateMachine::getSrvHeapDescriptorIndexForTextureSlot(int slot) { return textureTableIndex + MaxConstantBuffers; } void CoreStateMachine::prepareToDraw() { auto cl = continueRenderingCommands(); cl->SetGraphicsRootSignature(rootSignature.Get()); //prepare any buffers required by current shader program auto prog = currentProgram; if(prog) { if(globalConstantBuffers.empty()) { globalConstantBuffers.resize(MaxLatencyFrames); } for(int i = 0; i < MaxLatencyFrames; i++) { if(!globalConstantBuffers[currentFrameIndex]) { const size_t sz = GlobalCBufferMaxSize * GlobalCBufferMaxCalls; globalConstantBuffers[i] = make_shared<BufferArray>(); globalConstantBuffers[i]->ensureStaticSize(sz); globalConstantBuffers[i]->provideData(0, sz, nullptr, BufferArray::UT_FORCE_UPLOAD_HEAP); ThrowIfFailed(globalConstantBuffers[i]->buffers[0]->Map(0, nullptr, (void **)&globalConstantBufferData[i])); } } auto ba = globalConstantBuffers[currentFrameIndex]; memcpy(globalConstantBufferData[currentFrameIndex] + GlobalCBufferMaxSize * drawCallIndex, prog->globalCBufferData, prog->globalCBufferSize); cl->SetGraphicsRootConstantBufferView(2, ba->buffers[0]->GetGPUVirtualAddress() + GlobalCBufferMaxSize*drawCallIndex); } auto currentVA = VertexArray::current(); assert(currentVA != nullptr); if(lastUsedVA != currentVA) { currentVA->prepareForDraw(); lastUsedVA = currentVA; } if(descriptorTablesChanged) { device->CopyDescriptorsSimple(textureTableSize, cbSrvHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuCbSrvHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV); device->CopyDescriptorsSimple(textureTableSize, samplerHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuSamplerHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER); } if(descriptorHeapsChanged) { ID3D12DescriptorHeap *descHeaps[] = { cbSrvHeaps[descriptorHeapIndex]->get(), samplerHeaps[descriptorHeapIndex]->get() }; cl->SetDescriptorHeaps(ARRAYSIZE(descHeaps), descHeaps); descriptorHeapsChanged = false; } if(psoDirty) { //todo: these caches might be stale! I don't know if they rebuild properly when psoDesc doesn't correspond if(prog->psoCache) { psoDesc.CachedPSO = { prog->psoCache->GetBufferPointer(), prog->psoCacheSize }; } if(pipelineState) { preserveResourceUntilRenderComplete(pipelineState); pipelineState = nullptr; } if(FAILED(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState)))) { prog->psoCache = nullptr; psoDesc.CachedPSO = {}; ThrowIfFailed(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState))); } else { ComPtr<ID3DBlob> blob; ThrowIfFailed(pipelineState->GetCachedBlob(&blob)); prog->psoCache = blob; prog->psoCacheSize = blob->GetBufferSize(); } psoDirty = false; clPsoDirty = true; } if(clPsoDirty) { cl->SetPipelineState(pipelineState.Get()); clPsoDirty = false; } if(descriptorTablesChanged) { cl->SetGraphicsRootDescriptorTable(0, cbSrvHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); cl->SetGraphicsRootDescriptorTable(1, samplerHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); descriptorTablesChanged = false; } } void CoreStateMachine::indexBufferChanged() { fanEbosSet = 0; } void CoreStateMachine::bindConstantBuffer(BufferArray::Pointer buffer, int bufferIndex, size_t bufferSize, int registerIndex) { auto cbvHandle = cbSrvHeap->hCPU(registerIndex); D3D12_CONSTANT_BUFFER_VIEW_DESC cbvDesc = {}; auto &buf = *(buffer); assert(registerIndex > 0); cbvDesc.BufferLocation = buf[bufferIndex]->GetGPUVirtualAddress(); cbvDesc.SizeInBytes = bufferSize; device->CreateConstantBufferView(&cbvDesc, cbvHandle); } void CoreStateMachine::createShaderResourceView(int slot, ID3D12Resource *resource, D3D12_SHADER_RESOURCE_VIEW_DESC *rvd, D3D12_SAMPLER_DESC *svd) { //cpu-readable descriptors device->CreateSampler(svd, cpuSamplerHeap->hCPU(slot)); device->CreateShaderResourceView(resource, rvd, cpuCbSrvHeap->hCPU(MaxConstantBuffers + slot)); //shader-visible descriptors //device->CreateSampler(svd, samplerHeap->hCPU(textureTableIndex+slot)); //device->CreateShaderResourceView(resource, rvd, cbSrvHeap->hCPU(getSrvHeapDescriptorIndexForTextureSlot(slot))); } }
  3. mlfarrell00

    D3D12 atrocious performance

    After doing proper 2-frame latency, my performance shot up to 60.   Going to swap out all my comitted resources for placed resources in larger heaps and that should get me even better performance.   This stuff is complicated, but I'm stoked to be learning how to do all this properly.
  4. mlfarrell00

    D3D12 atrocious performance

      Hello triangle is my nemesis.  I used it as the guide for most of my work.  This example seems to be WAY better and I'm refactoring based off of it.  https://github.com/shobomaru/HelloD3D12/tree/master/ParallelFrameRootConstant   I knew that forcing the GPU to wait before building the next command list was stupid, I just didn't realize HOW stupid it was.  I suppose it's a miracle that I got 45-50 FPS with 1000 draw calls as it was doing that.  
  5. mlfarrell00

    D3D12 atrocious performance

      Thanks a ton for this.  I'm still in hack mode until I get my performance, then I'll probably write a similair class or modify my BufferArray to act this way.  I did the map memory thing and haven't yet noticed a performance benefit.   //New section for root cbuffer updates if(prog) { if(globalConstantBuffers.empty()) { globalConstantBuffers.resize(2); } if(!globalConstantBuffers[0]) { const size_t sz = 4096 * 10000; globalConstantBuffers[0] = make_shared<BufferArray>(); globalConstantBuffers[0]->ensureStaticSize(sz); globalConstantBuffers[0]->provideData(0, sz, nullptr, BufferArray::UT_FORCE_UPLOAD_HEAP); ThrowIfFailed(globalConstantBuffers[0]->buffers[0]->Map(0, nullptr, (void **)&globalConstantBufferData)); } auto ba = globalConstantBuffers[0]; //ba->provideData(0, prog->globalCBufferSize, prog->globalCBufferData, BufferArray::UT_FORCE_UPLOAD_HEAP); memcpy(globalConstantBufferData + 4096 * drawCallIndex, prog->globalCBufferData, prog->globalCBufferSize); cl->SetGraphicsRootConstantBufferView(2, ba->buffers[0]->GetGPUVirtualAddress() + 4096*drawCallIndex); } My guess is my final issue is ignoring frame latency.  That concept is entirely alien to me since I've spent most of my career using OpenGL and other APIs that don't really mention that kind of thing.   My question is, whats the exact point.     My hunch is this, correct me if I'm wrong:  You need latency so that you can compute the buffer data for the NEXT frame so that by the time its uploaded, the GPU next frame won't stall to access it.    Is that the reason?  Is that what OpenGL does too but I've never noticed.
  6. mlfarrell00

    D3D12 atrocious performance

      Trust me, you've got it right on.  I know D3D12 could beat the pants off GL/D3D11 when used properly.  Whats frustrating me is that after this much effort, I'm still falling short. Can you explain to me what you mean by mapping to a monotonically increasing address?     Here's my new per-draw-call update method.     With this code, I'm now about 45-50 FPS compared to OpenGL backend of same engine which gets a solid 60 fps (or more).  Again, sorry about the tabs.   The "psoDirty" is only true about twice throughout the entire rendering.  descriptorHeapsChanged never occurs (frame fits in just the one heap). I'm pretty sure the problem still lies in my constant buffer updates.  They were performing terribly when I used a default heap (with a copy from upload heap), and perform okay when I use upload heap only, but still are quite slow.  When I comment out the heap updates (done via map, memcpy, unmap), FPS shoots back up to solid 60.   void CoreStateMachine::prepareToDraw() { auto cl = continueRenderingCommands(); cl->SetGraphicsRootSignature(rootSignature.Get()); //prepare any buffers required by current shader program auto prog = currentProgram; if(prog) { if(globalConstantBuffers.empty()) { globalConstantBuffers.resize(10000); } if(!globalConstantBuffers[drawCallIndex]) { globalConstantBuffers[drawCallIndex] = make_shared<BufferArray>(); globalConstantBuffers[drawCallIndex]->ensureStaticSize(4096); } auto ba = globalConstantBuffers[drawCallIndex]; ba->provideData(0, prog->globalCBufferSize, prog->globalCBufferData, BufferArray::UT_FORCE_UPLOAD_HEAP); cl->SetGraphicsRootConstantBufferView(2, ba->buffers[0]->GetGPUVirtualAddress()); } auto currentVA = VertexArray::current(); assert(currentVA != nullptr); if(lastUsedVA != currentVA) { currentVA->prepareForDraw(); lastUsedVA = currentVA; } device->CopyDescriptorsSimple(textureTableSize, cbSrvHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuCbSrvHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV); device->CopyDescriptorsSimple(textureTableSize, samplerHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuSamplerHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER); if(descriptorHeapsChanged) { ID3D12DescriptorHeap *descHeaps[] = { cbSrvHeaps[descriptorHeapIndex]->get(), samplerHeaps[descriptorHeapIndex]->get() }; cl->SetDescriptorHeaps(ARRAYSIZE(descHeaps), descHeaps); descriptorHeapsChanged = false; } //might be smarter to set this up earlier if I can.. not sure what the tradeoff is here if(psoDirty) { if(prog->psoCache) { psoDesc.CachedPSO = { prog->psoCache->GetBufferPointer(), prog->psoCacheSize }; } if(pipelineState) { preserveResourceUntilRenderComplete(pipelineState); pipelineState = nullptr; } if(FAILED(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState)))) { prog->psoCache = nullptr; psoDesc.CachedPSO = {}; ThrowIfFailed(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState))); } else { ComPtr<ID3DBlob> blob; ThrowIfFailed(pipelineState->GetCachedBlob(&blob)); prog->psoCache = blob; prog->psoCacheSize = blob->GetBufferSize(); } psoDirty = false; } cl->SetPipelineState(pipelineState.Get()); if(descriptorTablesChanged) { cl->SetGraphicsRootDescriptorTable(0, cbSrvHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); cl->SetGraphicsRootDescriptorTable(1, samplerHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); descriptorTablesChanged = false; } } void CoreStateMachine::indexBufferChanged() { fanEbosSet = 0; } void CoreStateMachine::bindConstantBuffer(BufferArray::Pointer buffer, int bufferIndex, size_t bufferSize, int registerIndex) { auto cbvHandle = cbSrvHeap->hCPU(registerIndex); D3D12_CONSTANT_BUFFER_VIEW_DESC cbvDesc = {}; auto &buf = *(buffer); assert(registerIndex > 0); cbvDesc.BufferLocation = buf[bufferIndex]->GetGPUVirtualAddress(); cbvDesc.SizeInBytes = bufferSize; device->CreateConstantBufferView(&cbvDesc, cbvHandle); } void CoreStateMachine::createShaderResourceView(int slot, ID3D12Resource *resource, D3D12_SHADER_RESOURCE_VIEW_DESC *rvd, D3D12_SAMPLER_DESC *svd) { //cpu-readable descriptors device->CreateSampler(svd, cpuSamplerHeap->hCPU(slot)); device->CreateShaderResourceView(resource, rvd, cpuCbSrvHeap->hCPU(MaxConstantBuffers + slot)); //shader-visible descriptors //device->CreateSampler(svd, samplerHeap->hCPU(textureTableIndex+slot)); //device->CreateShaderResourceView(resource, rvd, cbSrvHeap->hCPU(getSrvHeapDescriptorIndexForTextureSlot(slot))); } }
  7. mlfarrell00

    D3D12 atrocious performance

    It may be time to give up on d3d12.  My opengl system massively outperforms it with a drastically simpler architecture, even on a freaking ipad.  I thought I'd be able to get performance gains without dicking around too much - that was the appeal of d3d12 and the coming vulkan to me.  But man... 7 days in a row of staying up til 2am... and I still don't have it.  This just plainly is not worth it for me.     My system changes "uniform" global constants very often and at unpredictable times.  At a high level, the kind of "precomputation" that d3d12 would need to be fast just isnt there.  It feels like instanced rendering all over again.  I can't seem to find a way to efficiently copy the needed cbuffer data before draw calls in a way that doesn't hurt performance.   I know my code is suboptimal, but I expected to outperform opengl at the very least with this kind of baseline.   very dissapointing...
  8. mlfarrell00

    D3D12 atrocious performance

    omfg.  here's another tip.  DON'T test performance with debug builds.  I forgot about how bad the performance of MSVC generated debug builds are.  Doing a release build shot the performance of the whole system up to max FPS   Spent the last hour chasing around a phantom
  9. mlfarrell00

    D3D12 atrocious performance

    Thanks.   You're so right.  I just came to this.  My arch isn't setup well for caching PSOs but after refactoring some stuff, fingers crossed, I'll get some good draw call performance.
  10. I'm getting about 10 FPS with only about 20 draw calls of very small meshes.   Clearly I'm doing something very very wrong.   Is there an issue with populating a single command list to draw 20 items?  I figured even if that isn't optimal, it shouldn't be THIS terrible.  At this point my iPad running OpenGL ES is outperforming my d3d12 engine by a massive margin.   The following code runs before most of my draw calls to set up necessary state:     Sorry for the messed up tab spacings void CoreStateMachine::prepareToDraw() { auto cl = continueRenderingCommands(); cl->SetGraphicsRootSignature(rootSignature.Get()); //prepare any buffers required by current shader program auto prog = currentProgram; if(prog) { if(!prog->globalCBuffer) { prog->globalCBuffer = make_shared<BufferArray>(); prog->globalCBufferDirty = true; } else { //this should be cleaned up at some point preserveResourceUntilRenderComplete(prog->globalCBuffer->uploadBuffers[0]); preserveResourceUntilRenderComplete(prog->globalCBuffer->buffers[0]); prog->globalCBuffer = make_shared<BufferArray>(); prog->globalCBufferDirty = true; } if(prog->globalCBufferDirty) { prog->globalCBuffer->provideData(0, prog->globalCBufferSize, prog->globalCBufferData, BufferArray::UT_DYNAMIC); prog->globalCBufferDirty = false; } cl->SetGraphicsRootConstantBufferView(2, prog->globalCBuffer->buffers[0]->GetGPUVirtualAddress()); } auto currentVA = VertexArray::current(); assert(currentVA != nullptr); currentVA->prepareForDraw(); device->CopyDescriptorsSimple(textureTableSize, cbSrvHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuCbSrvHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV); device->CopyDescriptorsSimple(textureTableSize, samplerHeaps[descriptorHeapIndex]->hCPU(textureTableIndex), cpuSamplerHeap->hCPU(0), D3D12_DESCRIPTOR_HEAP_TYPE_SAMPLER); if(descriptorHeapsChanged) { ID3D12DescriptorHeap *descHeaps[] = { cbSrvHeaps[descriptorHeapIndex]->get(), samplerHeaps[descriptorHeapIndex]->get() }; cl->SetDescriptorHeaps(ARRAYSIZE(descHeaps), descHeaps); descriptorHeapsChanged = false; } //might be smarter to set this up earlier if I can.. not sure what the tradeoff is here if(pipelineState) { preserveResourceUntilRenderComplete(pipelineState); pipelineState = nullptr; } ThrowIfFailed(device->CreateGraphicsPipelineState(&psoDesc, IID_PPV_ARGS(&pipelineState))); cl->SetPipelineState(pipelineState.Get()); if(descriptorTablesChanged) { cl->SetGraphicsRootDescriptorTable(0, cbSrvHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); cl->SetGraphicsRootDescriptorTable(1, samplerHeaps[descriptorHeapIndex]->hGPU(textureTableIndex)); descriptorTablesChanged = false; } }
  11. Edit:  I never learn to not post when I'm pissed off   the answer was the viewport struct was missing two more very important members which defaulted to 0 and 0 (the depth min and max for the viewport) D3D12_VIEWPORT vp = { x, y, w, h, 0, 1 }; //don't forget the 0 and 1
  12. Posting this to help others who land here with a similar problem   I was right in suspecting one global c buffer.   The following code allows me to reliably determine the cbuffer index of the global buffer. if(shadersBuilt[0]) { ID3D12ShaderReflection *reflector = nullptr; ThrowIfFailed(D3DReflect(prog->vertexShader.blob->GetBufferPointer(), prog->vertexShader.blob->GetBufferSize(), IID_ID3D12ShaderReflection, (void **)&reflector)); D3D12_SHADER_DESC descShader; ThrowIfFailed(reflector->GetDesc(&descShader)); for(int i = 0; i < descShader.ConstantBuffers; i++) { auto global = reflector->GetConstantBufferByIndex(i); D3D12_SHADER_BUFFER_DESC desc; ThrowIfFailed(global->GetDesc(&desc)); if(string(desc.Name).find("$Global") != string::npos) { vout << desc.Name << endl; auto var = global->GetVariableByIndex(0); D3D12_SHADER_VARIABLE_DESC descVar; ThrowIfFailed(var->GetDesc(&descVar)); vout << descVar.Name << endl; } } }
  13. I'm using angle to automatically translate my GLSL shader code for use in a D3D12 engine. I already know how to update constant buffers from the new API. Where I am stuck is how to get at variables that look like this: uniform float4 someVar : register c0; From the D3D12 API Does anybody know how I can do this? How do I populate "c" variables from DirectX 12? Edit: am I right to believe these go into the "$global" c buffer following the same packing rules?
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!