• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
       
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 short cmdlist vs. long cmdlist

This topic is 471 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

For recommendations of building commandlist in dx12, I remembered one which said that a good strategy is to have 12~20 draw calls per commandlist. Without wondering the reason behind it, I blindly target to fill around 12 draw/dispatch calls before submit my cmdlist to GPU in my 'single-thread' renderer.... Today after I fixed a GPU time stamp bug, I realized that my GPU always idled half of frame time.... GPU is waiting CPU 'accumulating enough draw calls' to hand the cmdlist to GPU...

 

Now I totally changed my strategy: as long as I know some draw/dispatch will take reasonable time for GPU, I submit it immediately to GPU even there may only be 1 draw/dispatch in that cmdlist, so while GPU is working on the job, CPU is building new jobs for GPU... (please let me know if that strategy is also not recommended...)

 

But why it is recommend to have 12~20 draw/dispatch per commandlist?   what's the difference between short and long cmdlist in terms of CPU/GPU overhead?

 

Thanks in advance 

Share this post


Link to post
Share on other sites
Advertisement

If you've noticed a difference in GPU idling between those two cases, I would guess that it's because you're syncing every frame instead of the recommended practice of having 1 frame of latency between CPU and GPU. This is mosly down to your presentation code.

You should have quite a few draws per submission because there's a large CPU cost involved in submission. One of D3D12's advantages is that draws are cheap, but doing one submit per draw-call will negate this benefit.

In my tests I found that ~500 draws per command list performed best in my game.

Thanks Hodgman,  could you elaborate on how to do the 1 frame of latency? or point me some resource about that? I have 5 frame buffers but I guess I get confused and lost somewhere. My engine is based on Microsoft's MiniEngine link though I modified lots of things, but the main framework is almost the same: So basically I record commandlist and submit it before present within the same frame.... 

My current project is more like academic research project so don't have ton of stuffs to draw, so typically I only have around 50 draw/dispatch calls per frame, I mean currently I can have CPU wait for GPU but definitely not the other way around. So what's your suggestion? 

 

also since my project have the following logic per frame, I feel it's very tricky to adapt '1frame of latency' strategy though.

do{
    m = CPU_ICPSolver( result ); // Nothing to do with GPU inside

    GPU_PrepareWorkingBuffer(
        depth_and_normalmap1, // input as SRV
        depth_and_normalmap2, // input as SRV
        matrix,               // input as CBV
        workingBuf);          // output as UAV (all 7 buffer)
    
    for (int i = 0; i < 7; ++i) {
        GPU_Reduction::Process(workingBuf[i]); // reduction to 1 float4 value inside GPU, but not copied to ReadBack buffer
    }
    GPU_Reduction::Readback( result ); // Read the reduction result, copy from default heap to readback heap, need to wait GPU inside

    reprojection_error = GetReprojectionError( result );
}while(iterations < 20 && reprojection_error > threshold) 

so my project has a long GPU, CPU work dependency chain here, any suggestions? thanks 

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

I remembered one which said that a good strategy is to have 12~20 draw calls per commandlist.

Actually IIRC the suggestion is 12-20 draw calls per command list minimum.  The document Practical_DX12_Programming_Model_and_Hardware_Capabilities.pdf (on page 7) actually suggests to aim for  15 to 30 command lists per frame split across 5 to 10 execute command lists invocations.  I think you can do more command lists then the recommendation but I think I remember reading or someone saying don't exceed 10 execute command lists calls per frame.

Share this post


Link to post
Share on other sites

I assume you have compute tasks with GPU<->CPU dependencies, but maybe while waiting you can do some draw calls based on the compute results from the previous frame.

This way you might be able to fill the bubbles and utilize Hodgmans suggestion, but probably at the cost of double buffering some memory.

Share this post


Link to post
Share on other sites
The problem you have is that readback - you are going to stall the GPU while you wait for the tasks pre-readback to complete, then do the readback, and then loop again.
Every time you wait on the result you will cause the CPU and GPU to sync - this is bad voodoo.

If you are only doing that loop in your application, well, you'll have to suck it up.

Games, however, will typically run something like that over a few frames or find some other method of keeping it all on the GPU for the whole loop in order to remove that readback stall, or at least minimise the impact; say do all the work, issue a readback via the copy queue and at the last moment stall for the result if it isn't ready while doing as much work as possible on the CPU/GPU to cover the copy time and avoid stalling. As you've a CPU-GPU data dependency this could be tricky without reworking the algorithm somewhat.

Share this post


Link to post
Share on other sites

 

I remembered one which said that a good strategy is to have 12~20 draw calls per commandlist.

Actually IIRC the suggestion is 12-20 draw calls per command list minimum.  The document Practical_DX12_Programming_Model_and_Hardware_Capabilities.pdf (on page 7) actually suggests to aim for  15 to 30 command lists per frame split across 5 to 10 execute command lists invocations.  I think you can do more command lists then the recommendation but I think I remember reading or someone saying don't exceed 10 execute command lists calls per frame.

 

Thanks infinisearch, what I curious is how this overhead affect execution time line:  Is this overhead totally on CPU side and doesn't stall GPU? so GPU can still work on previous tasks while CPU is finishing up sending cmdlist? I think if that is the case, I do have tons of spare CPU cycles to saturate GPU by paying such overhead. But if there are any GPU-CPU sync points in this overhead, I guess I probably have to rethink my algorithm.... :(

Thanks

Share this post


Link to post
Share on other sites

I assume you have compute tasks with GPU<->CPU dependencies, but maybe while waiting you can do some draw calls based on the compute results from the previous frame. This way you might be able to fill the bubbles and utilize Hodgmans suggestion, but probably at the cost of double buffering some memory.

If you are only doing that loop in your application, well, you'll have to suck it up. Games, however, will typically run something like that over a few frames or find some other method of keeping it all on the GPU for the whole loop in order to remove that readback stall, or at least minimise the impact; say do all the work, issue a readback via the copy queue and at the last moment stall for the result if it isn't ready while doing as much work as possible on the CPU/GPU to cover the copy time and avoid stalling. As you've a CPU-GPU data dependency this could be tricky without reworking the algorithm somewhat.
 

 

Thanks JoeJ and phantom. Sadly that GPU<->CPU dependency chain is very sensitive to latency, so can't spread it into multiple frame to hide the GPU stall. The best bit is replacing CPU_ICPSolver with GPU_ICPSolver....  the main task of that function is solving a 6x6 linear system, so if you guys know any GPU solver exist already, that will be my silver bullet! And given the matrix size is fixed 6x6 I think it should totally be doable.... but please do let me know if you think a GPU linear solver is super slow, unstable and not worth it, thanks

Share this post


Link to post
Share on other sites

Thanks infinisearch, what I curious is how this overhead affect execution time line:  Is this overhead totally on CPU side and doesn't stall GPU? so GPU can still work on previous tasks while CPU is finishing up sending cmdlist? I think if that is the case, I do have tons of spare CPU cycles to saturate GPU by paying such overhead. But if there are any GPU-CPU sync points in this overhead, I guess I probably have to rethink my algorithm.... Thanks

I think the number of command lists is a CPU optimization and the number of execute command lists is a GPU optimization.  For the latter read towards the end of this thread: https://www.gamedev.net/topic/677701-d3d12-resource-barriers-in-multiple-command-lists/

 

However I doubt this is causing you significant performance problems.

Share this post


Link to post
Share on other sites
However I doubt this is causing you significant performance problems.

Thanks Infinisearch, could you give me some suggestions where else should I look at?  I found the GPU idle time by placing timestamps before and after dispatch/draw and related PSO setting, descriptor moving and transition command with in each commandlist. But since I only have one thread generate commandlist, I also use cross commandlist timestamp pairs (one at the end of previous cmdlist, and one at the begin of current cmdlist) thus this cross cmdlist timestamps pair could effectively tell me the GPU idle time between this two cmdlist (one thing I think worth noting is that present call may between such timestamp pair, so I have no idea how that affect the timing.....)  Also I understand it is possible that other GPU task (from other application) may get inserted between my two consecutive cmdlist, so GPU may not be idle during such 'idle time' (please correct me if such thing is trickier than I thought). But what I noticed is around 5ms 'GPU idle' time with only Kinect service (which I believe use GPU to perform some work, but definitly not 5ms) runing in background, so I guess at least there must be something wrong with the way I generate cmdlist....

 

Please let me know is it safe to use cross cmdlist timestamp to measure GPU idle time (especially no present call get inbetween), and it will be great if you could list some other thing which may possibly cause such GPU idle time.

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement