maxest

DX11 ID3D11Query reporting weird results

Recommended Posts

I implemented DX queries after this blog post:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Queries work perfectly fine... for as long as I don't use VSync or any other form of Sleep. Why would that happe? I record queries right before my Compute/Dispatch code, record right after and then read the results (spinning on GetData if returns S_FALSE).
When I don't VSync then my code takes consistent 0.39-0.4 ms. After turning VSync on it starts with something like 0.46 ms, after a second bumps up to 0.61 ms and a few seconds after I get something like 1.2 ms.

I also used this source:
http://reedbeta.com/blog/gpu-profiling-101/
The difference here is that the author uses the disjoint query for the whole Render()  function instead of using one per particular measurement. When I implemented it this way the timings were incosistent (like above 0.46, 0.61, 1.2) regardless of VSync.

Share this post


Link to post
Share on other sites

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

Share this post


Link to post
Share on other sites
12 hours ago, Hodgman said:

Are you spinning on the query results immediately, or do you wait a frame before trying to get the results? 

I tested both. No difference.

10 hours ago, ajmiles said:

This behaviour sounds exactly like what I'd expect if the GPU was throttling back its frequency because you aren't giving it enough work to do to warrant being clocked at peak frequency.

By turning off VSync you're giving the GPU as much work to do as it can manage. With VSync enabled you're restricting it to 60 frames worth of work per second which it can easily deliver at reduced clock speeds.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Edited by maxest

Share this post


Link to post
Share on other sites
39 minutes ago, maxest said:

I tested both. No difference.

I thought about something along those lines but quickly came to a conclusion that it should not take place. I thought that everything should go and take as much time as in no-VSync case because it's the Present where the waiting happens; why would any redundant work happen in my actual computation time?
I just checked how much time Present takes with VSync and indeed it's something around 15 ms, with some variance of course. So still it's a mystery to me why the computation code I profile would take more time in VSync mode. Wonder if that would also be the case under D3D12.

EDIT: Encompassing the whole Render function with one disjoint ( http://reedbeta.com/blog/gpu-profiling-101/ ) actually works when VSync is off. I made wrong observation. It behvaes exactly the same as Begin/End of disjoint right before and after block we're profiling.

Even if you time only the work you're interested in (and not the whole frame), it's still going to take a variable amount of time depending on how high the GPU's clock speed happens to be at that point in time.

If the GPU can see it's only doing 2ms of work every 16ms, then it may underclock itself by a factor of 3-4x such that the 2ms of work ends up taking 6ms-8ms instead.

What's happening is something like this:

1) At 1500MHz, your work takes 0.4ms and ~16.2ms is spent idle at the end of the frame.
2) The GPU realises it could run a bit slower and still be done in plenty of time so it underclocks itself just a little bit to save power.
3) At 1200MHz, your work takes 0.5ms and ~16.1ms is spent idle at the end of the frame.
4) Still plenty of time spare, so it underclocks itself even further.
5) At 900MHz, your work takes 0.6ms and ~16.0ms is spent idle at the end of the frame.
6) *Still* plenty of time spare, so it dramatically underclocks itself.
7) At 500MHz, your work takes 3x longer than it did originally, now costing 1.2ms. There's still 15.4ms of idle time at the end of the frame, so this is still OK.
8) At this point the GPU may not have any lower power states to clock down to, so the work never takes any more than 1.2ms.

In D3D12 we (Microsoft) added an API called ID3D12Device::SetStablePowerState, in part to address this problem.

This API fixes the GPU's clock speed to something it can always run at without having to throttle back from due to thermal or power limitations. So if your GPU has a "Base Clock" of 1500MHz but can periodically "Boost" to 1650MHz, we'll fix the clock speed to 1500MHz. Note that this API does not work on end-users machines as it requires Debug bits to be installed, so can't be used in retail titles. Note also that performance will likely be worse than on an end-user's machine because we've artificially limited the clock speed below the peak to ensure a stable and consistent clock speed. With this in place, profiling becomes easier because the clock speed is known to be stable across runs and won't clock up and down as in your situation.

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

Share this post


Link to post
Share on other sites
2 hours ago, ajmiles said:

Since I don't think SetStablePowerState was ever added to D3D11, it should be simple enough to create a dummy D3D12 application, create a device, call SetStablePowerState and then put the application into an infinite Sleep in the background. I've never tried this, but that should be sufficient to keep the GPU's frequency fixed to some value for the lifetime that this dummy D3D12 application/device is created and running.

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites

@ajmiles: Thank you so so much for this detailed explanation. I hadn't thought about GPU clock changing its speed. This makes more sense that performing some redundant work :).

I have checked what you proposed. Got some simple DX12 sample, called SetStablePowerState and set it to true (needed to turn on Developer Mode on on my Windows 10; wasn't aware of its existence) and called permanent Sleep. Then I ran my application. Now regardless of whether I use VSync or not, call Sleep in my app or not, I get consistent 0.46 ms. It's more than without-VSync-and-SetStablePowerState 0.4 ms but at least it's stable. So as I understand the GPU is working at lower clock speed than it could (without Boost) but this speed is fixed.

I have one more case whose results I don't entirely understand. I have code of this form:

-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler


The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I would just like to ask one more, relevant thing. In my quest for search of reliable counters I stumbled upon this (https://msdn.microsoft.com/en-us/library/windows/desktop/ff476364(v=vs.85).aspx) but could find no simple example of usage. Is it working at all?

Edited by maxest

Share this post


Link to post
Share on other sites
39 minutes ago, SoldierOfLight said:

That's a great idea in theory, except that we've deprecated this API in recent Windows 10 releases (I don't recall exactly when), so you'll need to be on a slightly older build. What we found is that given your example of a base of 1500 and a boost of 1650, the GPU is able to maintain that boosted clock rate nearly indefinitely. So using SetStablePowerState produces a completely artificial scenario that doesn't mimic what would happen on real world machines, making it relatively useless for profiling.

Interesting, it might be that we haven't pushed anything out yet with that change in. It still exists in the Creators Update SDK and whatever release of Windows 10 'maxest' is running it still seems to work. 

I'll follow up with you offline why we decided the API wasn't useful. It feels like it still has value in scenarios where you want a consistent time from run-to-run and want to analyse whether an algorithmic change improves performance or not. Even if it doesn't give you real numbers for any user in the real world, consistency across runs still seems useful during development / optimisation.

38 minutes ago, maxest said:

I have one more case whose results I don't entirely understand. I have code of this form:


-- Begin CPU Profiler (with QueryPerformanceCounter etc.)
-- Begin GPU Profile
CopyResource (download from GPU to CPU)
Map
-- End GPU Profiler
do something with mapped data
Unmap
-- End CPU Profiler

The GPU profiler reports 5 ms whereas CPU reports 2-3 ms. If anything, should the CPU timer not report time bigger than GPU? I download around 1 MB of data. When I measure with CPU timer only CopyResource and Map I get around 1 ms.

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I've never see D3D11_COUNTER used before, but Jesse (SoldierOfLight) may know whether it ever saw any use.

Edited by ajmiles

Share this post


Link to post
Share on other sites

As far as counters go, they're all for IHV-specific counters. In D3D10 there were API-defined counters, but they were deprecated in D3D11.

The current model for performance counters is the plugin model exposed by PIX.

Also I just checked, and apparently I was wrong about SetStablePowerState, we did keep it around, we just moved it from requiring the D3D12 debug layers, to requiring developer mode. My bad.

Edited by SoldierOfLight

Share this post


Link to post
Share on other sites
1 hour ago, ajmiles said:

I don't have a definitive answer to why this might be, but I do have one theory.

You can think of (almost) every API call you make being a packet of data that gets fed to the GPU to execute at a later date. Behind the scenes these packets of data (Draw, Dispatch, Copy, etc) are broken up into segments and sent to the GPU as a batch rather than 1 by 1. The Begin/End Query packets are no different. It may be that the Timestamp query you've inserted after the "Map" is the first command after a batch of commands is sent to the GPU and therefore it isn't immediately sent to the GPU after the CopyResource/Map events have executed. Therefore, my theory is that you're actually timing a lot of idle time between the CopyResource and the next chunk of GPU work that causes the buffer to get flushed and the GPU starts executing useful work again.

You don't have any control over when D3D11 breaks a segment and flushes the commands to the GPU (you can force a flush using ID3D11DeviceContext::Flush, but you can't prevent one). I wouldn't expect 'Map' to do anything on the GPU, but moving the timestamp query before the map may be sufficient to get the timestamp query executed in the segment before the break. Try that perhaps?

I actually did try placing End query right after CopyResource and before Map and that reported (as far as I remember, can't check now) something around 0.1 ms. Now I'm not really sure how should I measure the time it takes to download data from GPU to CPU. My CPU timer, when used to enclose CopyResource and Map, reported that downloading 11.5 GB took 1 second, what agrees with some CUDA-based test application for measuring PCI-E throughput that I used. When lowered down to 8 MB the download took 1.5 ms and when lowered to 1 MB the download took 1 ms. I'm not sure if PCI-E downloads should scale linearly as a function of data size but my tests show that they don't. At least that's what my CPU timer says. But the 0.1 ms reported by GPU timer when measuring CopyResource would indicate linear scale. Now I'm not sure if I should trust the CPU time reporting 1 ms (CopyResource + Map) or the GPU timer reporting 0.1 ms (just CopyResource).

Share this post


Link to post
Share on other sites

0.1ms sounds about right for copying 1MB over a bus that's roughly 16GB/s, so I'd be inclined to believe that number. It should scale approximately linearly.

You have to bear in mind that the CPU timer isn't just timing how long it takes the CPU to do useful work, but how long it takes the GPU to catch up and do all its outstanding work. By calling Map you've required the GPU to catch up and execute all the work in its queue, do the copy and signal to the CPU that it's done. The more work the GPU has to run prior to the call to "CopyResource", the longer the CPU has to sit there and wait for it to complete. For that reason, I wouldn't expect the CPU timer to ever record a very low value in the region of 0.1ms no matter how small the copy is.

Share this post


Link to post
Share on other sites

I thought it should have been 0.1 ms as after refactoring the whole "system" I'm working on so that I need to only download 1 MB instead of 8 MB the total processing time went down by around 1.5 ms.

Thank you again so much ajmiles.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


  • Forum Statistics

    • Total Topics
      628277
    • Total Posts
      2981770
  • Similar Content

    • By mellinoe
      Hi all,
      First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
      Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
      The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
      Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.
    • By turanszkij
      If I do a buffer update with MAP_NO_OVERWRITE or MAP_DISCARD, can I just write to the buffer after I called Unmap() on the buffer? It seems to work fine for me (Nvidia driver), but is it actually legal to do so? I have a graphics device wrapper and I don't want to expose Map/Unmap, but just have a function like void* AllocateFromRingBuffer(GPUBuffer* buffer, uint size, uint& offset); This function would just call Map on the buffer, then Unmap immediately and then return the address of the buffer. It usually does a MAP_NO_OVERWRITE, but sometimes it is a WRITE_DISCARD (when the buffer wraps around). Previously I have been using it so that the function expected the data upfront and would copy to the buffer between Map/Unmap, but now I want to extend functionality of it so that it would just return an address to write to.
    • By mister345
      Trying to write a multitexturing shader in DirectX11 - 3 textures work fine, but adding 4th gets sampled as black!
      Could you please look at the textureClass.cpp line 79? - I'm guess its D3D11_TEXTURE2D_DESC settings are wrong, 
      but no idea how to set it up right. I tried changing ArraySize from 1 to 4, but does nothing. If thats not the issue, please look
      at the LightShader_ps - maybe doing something wrong there? Otherwise, no idea.
          // Setup the description of the texture.
          textureDesc.Height = height;
          textureDesc.Width = width;
          textureDesc.MipLevels = 0;
          textureDesc.ArraySize = 1;
          textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
          textureDesc.SampleDesc.Count = 1;
          textureDesc.SampleDesc.Quality = 0;
          textureDesc.Usage = D3D11_USAGE_DEFAULT;
          textureDesc.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_RENDER_TARGET;
          textureDesc.CPUAccessFlags = 0;
          textureDesc.MiscFlags = D3D11_RESOURCE_MISC_GENERATE_MIPS;
      Please help, thanks.
      https://github.com/mister51213/DirectX11Engine/blob/master/DirectX11Engine/Texture.cpp
       
    • By GameDevCoder
      I have to learn DirectX for a course I am studying. This book https://www.amazon.co.uk/Introduction-3D-Game-Programming-Directx/dp/1936420228 I felt would be great for me to learn from.
      The trouble is the examples which are all offered here http://www.d3dcoder.net/d3d11.htm . They do not work for me. This is a known issue as there is a link on the examples page saying how to fix it. I'm having difficulty with doing this though. This is the page with the solution http://www.d3dcoder.net/Data/Book4/d3d11Win10.htm.
      The reason why this problem is happening, the book was released before Windows 10 was released. Now when the examples are run they need slight fixes in order for them to even work. I just can't get these examples working at all.
      Would anyone be able to help me get the examples working please. I am running Windows 10 also just to make this clear, so this is why the examples are experiencing the not so desired behaviour. I just wish they would work straight away but there seems to be issues with the examples from this book mainly because of it trying to run from a Windows 10 OS.
      On top of this, if anyone has any suggestions with how I can learn DirectX 11 i would be most grateful. Thanks very much. I really would like to get them examples working to though from the book I mentioned.
      Look forward to reading any replies this thread receives.
       
      GameDevCoder.


      PS - If anyone has noticed. I asked this about 1 year ago also but this was when I was dabbling in it. Now I am actually needing to produce some stuff with DirectX so I have to get my head round this now. I felt at the time that I sort of understood what was being written to me in response to my thread back then. I had always been a little unsure though of being absolutely sure of what was happening with these troublesome examples. So I am really just trying to get to the bottom of this now. If anyone can help me work these examples out so I can see them working then hopefully I can learn DirectX 11 from them.
       
      *SOLUTION* - I was able to get the examples running thanks to the gamedev.net community. Great work guys. I'm so please now that I can learn from this book now I have the examples running.
      https://www.gamedev.net/forums/topic/693437-i-need-to-learn-directx-the-examples-for-introduction-to-3d-programming-with-directx-11-by-frank-d-luna-does-not-work-can-anyone-help-me/?do=findComment&comment=5363013
    • By DiligentDev
      Hello!
      I would like to introduce Diligent Engine, a project that I've been recently working on. Diligent Engine is a light-weight cross-platform abstraction layer between the application and the platform-specific graphics API. Its main goal is to take advantages of the next-generation APIs such as Direct3D12 and Vulkan, but at the same time provide support for older platforms via Direct3D11, OpenGL and OpenGLES. Diligent Engine exposes common front-end for all supported platforms and provides interoperability with underlying native API. It also supports integration with Unity and is designed to be used as a graphics subsystem in a standalone game engine, Unity native plugin or any other 3D application. It is distributed under Apache 2.0 license and is free to use. Full source code is available for download on GitHub. The engine contains shader source code converter that allows shaders authored in HLSL to be translated to GLSL.
      The engine currently supports Direct3D11, Direct3D12, and OpenGL/GLES on Win32, Universal Windows and Android platforms.
      API Basics
      Initialization
      The engine can perform initialization of the API or attach to already existing D3D11/D3D12 device or OpenGL/GLES context. For instance, the following code shows how the engine can be initialized in D3D12 mode:
      #include "RenderDeviceFactoryD3D12.h" using namespace Diligent; // ...  GetEngineFactoryD3D12Type GetEngineFactoryD3D12 = nullptr; // Load the dll and import GetEngineFactoryD3D12() function LoadGraphicsEngineD3D12(GetEngineFactoryD3D12); auto *pFactoryD3D11 = GetEngineFactoryD3D12(); EngineD3D12Attribs EngD3D12Attribs; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[0] = 1024; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[1] = 32; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[2] = 16; EngD3D12Attribs.CPUDescriptorHeapAllocationSize[3] = 16; EngD3D12Attribs.NumCommandsToFlushCmdList = 64; RefCntAutoPtr<IRenderDevice> pRenderDevice; RefCntAutoPtr<IDeviceContext> pImmediateContext; SwapChainDesc SwapChainDesc; RefCntAutoPtr<ISwapChain> pSwapChain; pFactoryD3D11->CreateDeviceAndContextsD3D12( EngD3D12Attribs, &pRenderDevice, &pImmediateContext, 0 ); pFactoryD3D11->CreateSwapChainD3D12( pRenderDevice, pImmediateContext, SwapChainDesc, hWnd, &pSwapChain ); Creating Resources
      Device resources are created by the render device. The two main resource types are buffers, which represent linear memory, and textures, which use memory layouts optimized for fast filtering. To create a buffer, you need to populate BufferDesc structure and call IRenderDevice::CreateBuffer(). The following code creates a uniform (constant) buffer:
      BufferDesc BuffDesc; BufferDesc.Name = "Uniform buffer"; BuffDesc.BindFlags = BIND_UNIFORM_BUFFER; BuffDesc.Usage = USAGE_DYNAMIC; BuffDesc.uiSizeInBytes = sizeof(ShaderConstants); BuffDesc.CPUAccessFlags = CPU_ACCESS_WRITE; m_pDevice->CreateBuffer( BuffDesc, BufferData(), &m_pConstantBuffer ); Similar, to create a texture, populate TextureDesc structure and call IRenderDevice::CreateTexture() as in the following example:
      TextureDesc TexDesc; TexDesc.Name = "My texture 2D"; TexDesc.Type = TEXTURE_TYPE_2D; TexDesc.Width = 1024; TexDesc.Height = 1024; TexDesc.Format = TEX_FORMAT_RGBA8_UNORM; TexDesc.Usage = USAGE_DEFAULT; TexDesc.BindFlags = BIND_SHADER_RESOURCE | BIND_RENDER_TARGET | BIND_UNORDERED_ACCESS; TexDesc.Name = "Sample 2D Texture"; m_pRenderDevice->CreateTexture( TexDesc, TextureData(), &m_pTestTex ); Initializing Pipeline State
      Diligent Engine follows Direct3D12 style to configure the graphics/compute pipeline. One big Pipelines State Object (PSO) encompasses all required states (all shader stages, input layout description, depth stencil, rasterizer and blend state descriptions etc.)
      Creating Shaders
      To create a shader, populate ShaderCreationAttribs structure. An important member is ShaderCreationAttribs::SourceLanguage. The following are valid values for this member:
      SHADER_SOURCE_LANGUAGE_DEFAULT  - The shader source format matches the underlying graphics API: HLSL for D3D11 or D3D12 mode, and GLSL for OpenGL and OpenGLES modes. SHADER_SOURCE_LANGUAGE_HLSL  - The shader source is in HLSL. For OpenGL and OpenGLES modes, the source code will be converted to GLSL. See shader converter for details. SHADER_SOURCE_LANGUAGE_GLSL  - The shader source is in GLSL. There is currently no GLSL to HLSL converter. To allow grouping of resources based on the frequency of expected change, Diligent Engine introduces classification of shader variables:
      Static variables (SHADER_VARIABLE_TYPE_STATIC) are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. Mutable variables (SHADER_VARIABLE_TYPE_MUTABLE) define resources that are expected to change on a per-material frequency. Examples may include diffuse textures, normal maps etc. Dynamic variables (SHADER_VARIABLE_TYPE_DYNAMIC) are expected to change frequently and randomly. This post describes the resource binding model in Diligent Engine.
      The following is an example of shader initialization:
      ShaderCreationAttribs Attrs; Attrs.Desc.Name = "MyPixelShader"; Attrs.FilePath = "MyShaderFile.fx"; Attrs.SearchDirectories = "shaders;shaders\\inc;"; Attrs.EntryPoint = "MyPixelShader"; Attrs.Desc.ShaderType = SHADER_TYPE_PIXEL; Attrs.SourceLanguage = SHADER_SOURCE_LANGUAGE_HLSL; BasicShaderSourceStreamFactory BasicSSSFactory(Attrs.SearchDirectories); Attrs.pShaderSourceStreamFactory = &BasicSSSFactory; ShaderVariableDesc ShaderVars[] =  {     {"g_StaticTexture", SHADER_VARIABLE_TYPE_STATIC},     {"g_MutableTexture", SHADER_VARIABLE_TYPE_MUTABLE},     {"g_DynamicTexture", SHADER_VARIABLE_TYPE_DYNAMIC} }; Attrs.Desc.VariableDesc = ShaderVars; Attrs.Desc.NumVariables = _countof(ShaderVars); Attrs.Desc.DefaultVariableType = SHADER_VARIABLE_TYPE_STATIC; StaticSamplerDesc StaticSampler; StaticSampler.Desc.MinFilter = FILTER_TYPE_LINEAR; StaticSampler.Desc.MagFilter = FILTER_TYPE_LINEAR; StaticSampler.Desc.MipFilter = FILTER_TYPE_LINEAR; StaticSampler.TextureName = "g_MutableTexture"; Attrs.Desc.NumStaticSamplers = 1; Attrs.Desc.StaticSamplers = &StaticSampler; ShaderMacroHelper Macros; Macros.AddShaderMacro("USE_SHADOWS", 1); Macros.AddShaderMacro("NUM_SHADOW_SAMPLES", 4); Macros.Finalize(); Attrs.Macros = Macros; RefCntAutoPtr<IShader> pShader; m_pDevice->CreateShader( Attrs, &pShader ); Creating the Pipeline State Object
      To create a pipeline state object, define instance of PipelineStateDesc structure. The structure defines the pipeline specifics such as if the pipeline is a compute pipeline, number and format of render targets as well as depth-stencil format:
      // This is a graphics pipeline PSODesc.IsComputePipeline = false; PSODesc.GraphicsPipeline.NumRenderTargets = 1; PSODesc.GraphicsPipeline.RTVFormats[0] = TEX_FORMAT_RGBA8_UNORM_SRGB; PSODesc.GraphicsPipeline.DSVFormat = TEX_FORMAT_D32_FLOAT; The structure also defines depth-stencil, rasterizer, blend state, input layout and other parameters. For instance, rasterizer state can be defined as in the code snippet below:
      // Init rasterizer state RasterizerStateDesc &RasterizerDesc = PSODesc.GraphicsPipeline.RasterizerDesc; RasterizerDesc.FillMode = FILL_MODE_SOLID; RasterizerDesc.CullMode = CULL_MODE_NONE; RasterizerDesc.FrontCounterClockwise = True; RasterizerDesc.ScissorEnable = True; //RSDesc.MultisampleEnable = false; // do not allow msaa (fonts would be degraded) RasterizerDesc.AntialiasedLineEnable = False; When all fields are populated, call IRenderDevice::CreatePipelineState() to create the PSO:
      m_pDev->CreatePipelineState(PSODesc, &m_pPSO); Binding Shader Resources
      Shader resource binding in Diligent Engine is based on grouping variables in 3 different groups (static, mutable and dynamic). Static variables are variables that are expected to be set only once. They may not be changed once a resource is bound to the variable. Such variables are intended to hold global constants such as camera attributes or global light attributes constant buffers. They are bound directly to the shader object:
       
      PixelShader->GetShaderVariable( "g_tex2DShadowMap" )->Set( pShadowMapSRV ); Mutable and dynamic variables are bound via a new object called Shader Resource Binding (SRB), which is created by the pipeline state:
      m_pPSO->CreateShaderResourceBinding(&m_pSRB); Dynamic and mutable resources are then bound through SRB object:
      m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "tex2DDiffuse")->Set(pDiffuseTexSRV); m_pSRB->GetVariable(SHADER_TYPE_VERTEX, "cbRandomAttribs")->Set(pRandomAttrsCB); The difference between mutable and dynamic resources is that mutable ones can only be set once for every instance of a shader resource binding. Dynamic resources can be set multiple times. It is important to properly set the variable type as this may affect performance. Static variables are generally most efficient, followed by mutable. Dynamic variables are most expensive from performance point of view. This post explains shader resource binding in more details.
      Setting the Pipeline State and Invoking Draw Command
      Before any draw command can be invoked, all required vertex and index buffers as well as the pipeline state should be bound to the device context:
      // Clear render target const float zero[4] = {0, 0, 0, 0}; m_pContext->ClearRenderTarget(nullptr, zero); // Set vertex and index buffers IBuffer *buffer[] = {m_pVertexBuffer}; Uint32 offsets[] = {0}; Uint32 strides[] = {sizeof(MyVertex)}; m_pContext->SetVertexBuffers(0, 1, buffer, strides, offsets, SET_VERTEX_BUFFERS_FLAG_RESET); m_pContext->SetIndexBuffer(m_pIndexBuffer, 0); m_pContext->SetPipelineState(m_pPSO); Also, all shader resources must be committed to the device context:
      m_pContext->CommitShaderResources(m_pSRB, COMMIT_SHADER_RESOURCES_FLAG_TRANSITION_RESOURCES); When all required states and resources are bound, IDeviceContext::Draw() can be used to execute draw command or IDeviceContext::DispatchCompute() can be used to execute compute command. Note that for a draw command, graphics pipeline must be bound, and for dispatch command, compute pipeline must be bound. Draw() takes DrawAttribs structure as an argument. The structure members define all attributes required to perform the command (primitive topology, number of vertices or indices, if draw call is indexed or not, if draw call is instanced or not, if draw call is indirect or not, etc.). For example:
      DrawAttribs attrs; attrs.IsIndexed = true; attrs.IndexType = VT_UINT16; attrs.NumIndices = 36; attrs.Topology = PRIMITIVE_TOPOLOGY_TRIANGLE_LIST; pContext->Draw(attrs); Build Instructions
      Please visit this page for detailed build instructions.
      Samples
      The engine contains two graphics samples that demonstrate how the API can be used.
      AntTweakBar sample demonstrates how to use AntTweakBar library to create simple user interface. It can also be thought of as Diligent Engine’s “Hello World” example.

       
      Atmospheric scattering sample is a more advanced one. It demonstrates how Diligent Engine can be used to implement various rendering tasks: loading textures from files, using complex shaders, rendering to textures, using compute shaders and unordered access views, etc. 

       
      The engine also includes Asteroids performance benchmark based on this demo developed by Intel. It renders 50,000 unique textured asteroids and lets compare performance of D3D11 and D3D12 implementations. Every asteroid is a combination of one of 1000 unique meshes and one of 10 unique textures. 

      Integration with Unity
      Diligent Engine supports integration with Unity through Unity low-level native plugin interface. The engine relies on Native API Interoperability to attach to the graphics API initialized by Unity. After Diligent Engine device and context are created, they can be used us usual to create resources and issue rendering commands. GhostCubePlugin shows an example how Diligent Engine can be used to render a ghost cube only visible as a reflection in a mirror.

       
  • Popular Now