DX12 and threading

Started by
15 comments, last by lubbe75 6 years, 4 months ago

Being new to DirectX 12 I am looking for examples on how to use threading. I have done lots of OpenGL in the past and some DirectX, but with DX12 the threading magic is gone and I understand that threading is crucial to get good performance. In my project I currently have one thread doing it all. I have one command list, one command allocator, one bundle and one bundle allocator. I also have a million triangles, so it's about time that I start doing this.

How do I split things up? How many threads should I use? How many command lists and allocators?

I realize this is a beginner's question , but I have to begin somewhere. I would be grateful if someone could point me in a direction where I could find a simple code sample, tutorial or something similar. Thanks!

Advertisement
5 hours ago, lubbe75 said:

I also have a million triangles, so it's about time that I start doing this.

Number of triangles is irrelevant to the CPU - how many draw calls do you have? If it's thousands, you may get some benefit from using multiple threads to record the draw commands. In my experience, with less than around a thousand draws, there's not much benefit in threaded draw submission. 

5 hours ago, lubbe75 said:

How many threads should I use?

Most engines these days make a pool of one thread per CPU core, and then split all of their workloads up amongst that pool. So on a quad core, I'd use a max of 4 threads, and as above, also no more than around (draws/1000)+1 threads. 

We have a work-stealing task scheduler that spawns 1 thread for every core on the CPU (minus 1 for the main thread). Then we create a bunch of tasks for groups of draw calls, and throw them a the task scheduler. We've tried both 1 thread per logical core (Intel CPU's with hyperthreading have 2 logical cores for every physical core) as well as 1 thread per physical core, and we've generally found that trying to run our task scheduler thread on both logical cores to be somewhat counterproductive. But your mileage may vary. AMD has some code here that can show you how to query the relevant CPU information,

Writing your own task scheduler can be quite a bit of work (especially fixing all of the bugs!), but it can also be very educational. There's a pretty good series of articles here that can get you started. There's also third-party libraries like Intel's Thread Building Blocks (which is very comprehensive, but also a bit complex and very heavyweight), or Doug Bink's enkiTS (which is simple and lightweight, but doesn't have fancier high-level features). Windows also has a built-in thead pool API, but I've never used it myself so I can't really vouch for its effectiveness in a game engine scenario.

My general advice for starting on multithreading programming is to carefully plan out which data will be touched by each separate task. IMO the easiest (and fastest!) way to have multiple threads work effectively is to make sure that they never touch the same data, or at least do so as infrequently as possible. If you have lots of shared things it can messy, slow, and error-prone very quickly if you have to manually wrap things in critical sections. Also keep in mind that *reading* data from multiple threads is generally fine, and it's *writing* to the same data that usually gets you in trouble. So it can help to figure out exactly which data is immutable during a particular phase of execution, and perhaps also enforce that through judicious use of the "const" keyword.

Thanks for the tips and the links! 

After reading a bit more I get the idea that threading is mainly for recording command lists. Is this correct? Would this also include executing command lists?

Before adding threads, will I benefit anything from using multiple command lists, command allocators or command queues?

I have read somewhere that using multiple command allocators can increase performance since I may not have to wait as often before recording the next frame. I guess it's a matter of experimenting with the number of allocators that would be needed in my case.

Would using multiple command lists or multiple command queues have the same effect as using multiple allocators, or will this only make sense with multi-threading? 

I'm currently in a stage where my Dx9 renderer is about 20 times faster than my Dx12 renderer, so I guessing it's mainly multi-threading that is missing. Do you know any other obvious and common beginner mistakes when starting with Dx12?

 

Before messing around with threading, 1 thing you'll want to do is make sure that the CPU and GPU are working in parallel. When starting out with DX12, you'll probably have things set up like this:

Record command list for frame 0 -> submit command list for frame 0 - > wait for GPU to process frame 0 (by waiting on a fence -> Record comand list for frame 1

If you do it this way the GPU will be idle while the CPU is doing work, and the CPU will be idle while the GPU is doing work. To make sure that the CPU and GPU are pipelined (both working at the same time), you need to do it like this:

Record command list for frame 0 -> submit command list for frame 0 -> record command list for frame 1 -> submit command list for frame 1 -> wait for the GPU to finish frame 0 -> record command list for frame 2

With this setup the GPU will effectively be a frame behind the CPU, but your overall throughput (framerate) will be higher since the CPU and GPU will be working concurrently instead of in lockstep. The big catch is that since the CPU is preparing the next frame while the GPU is actively processing commands, you need to be careful not to modify things that the GPU is reading from. This is where the "multiple command allocators" thing comes in: if you switch back and forth between two allocators, you'll always be modifying one command allocator while the GPU is reading from the other one. The same concept applies to things like constant buffers that are written to by the CPU.

Once you've got that working, you can look into splitting things up into multiple command lists that are recorded by multiple threads. Without multiple threads there's no reason to have more than 1 command list unless you're also submitting to multiple queues. Multi-queue is quite complicated, and is definitely an advanced topic. COPY queues are generally useful for initializing resources like textures. COMPUTE queues can be useful for GPU's that support concurrently processing compute commands alongside graphics commands, which can result in higher overall throughput in certain scenarios. They can also be useful for cases where the compute work is completely independent of your graphics work, and therefore doesn't need to be synchronized with your graphics commands.

On 12/8/2017 at 5:13 AM, lubbe75 said:

After reading a bit more I get the idea that threading is mainly for recording command lists. Is this correct? Would this also include executing command lists?

Before adding threads, will I benefit anything from using multiple command lists, command allocators or command queues?

Read through this document it should answer your questions.

https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf

-potential energy is easily made kinetic-

Thanks for that link, Infinisearch!

MJP, I have tried what you suggested, but I got poorer results compared to the straight forward 1-allocator method. Here is what I tried:

After initiating, setting frameIndex to 0 and resetting commandList with allocator 0 I run the following loop (pseudo-code):


populate commandList;
execute commandList;
reset commandList (using allocator[frameIndex]);
present the frame;
frameIndex = swapChain.CurrentBackBufferIndex; // 0 -> 1, 1 -> 0
 
if (frameIndex == 1) 
{
    // set the fence after frame 0, 2, 4, 6, 8, ...
    commandQueue.Signal(fence, fenceValue);
}
else
{
    // wait for the fence after frame 1, 3, 5, 7, 9, ...
    int currentFence = fenceValue;
    fenceValue++;
    if (fence.CompletedValue < currentFence)
    {
        fence.SetEventOnCompletion(currentFence, fenceEvent.SafeWaitHandle.DangerousGetHandle());
        fenceEvent.WaitOne();
    }
}

Have I understood the idea correctly (I think I do)? Perhaps something here gets done in the wrong order?

 

 

That's not quite what I meant. You'll still want to signal your fence and wait on it every frame, you just need to wait on the value one frame later. The first frame you don't need to wait because there was no "previous" frame, but you do need to wait for every frame after that. Here's what my code looks like, minus a few things that aren't relevant:


void EndFrame(IDXGISwapChain4* swapChain, uint32 syncIntervals)
{
    DXCall(CmdList->Close());

    ID3D12CommandList* commandLists[] = { CmdList };
    GfxQueue->ExecuteCommandLists(ArraySize_(commandLists), commandLists);

    // Present the frame.
    DXCall(swapChain->Present(syncIntervals, syncIntervals == 0 ? DXGI_PRESENT_ALLOW_TEARING : 0));

    ++CurrentCPUFrame;

    // Signal the fence with the current frame number, so that we can check back on it
    FrameFence.Signal(GfxQueue, CurrentCPUFrame);

    // Wait for the GPU to catch up before we stomp an executing command buffer
    const uint64 gpuLag = DX12::CurrentCPUFrame - DX12::CurrentGPUFrame;
    Assert_(gpuLag <= DX12::RenderLatency);
    if(gpuLag >= DX12::RenderLatency)
    {
        // Make sure that the previous frame is finished
        FrameFence.Wait(DX12::CurrentGPUFrame + 1);
        ++DX12::CurrentGPUFrame;
    }

    CurrFrameIdx = DX12::CurrentCPUFrame % NumCmdAllocators;

    // Prepare the command buffers to be used for the next frame
    DXCall(CmdAllocators[CurrFrameIdx]->Reset());
    DXCall(CmdList->Reset(CmdAllocators[CurrFrameIdx], nullptr));
}

 

13 hours ago, MJP said:

That's not quite what I meant. You'll still want to signal your fence and wait on it every frame, you just need to wait on the value one frame later. The first frame you don't need to wait because there was no "previous" frame, but you do need to wait for every frame after that. Here's what my code looks like, minus a few things that aren't relevant:

MJP I didn't look at the linked code but do you do anything for frame pacing in the full code?  I see that gamers on the internet complain about frame pacing quite a lot when they seem to percieve issues with it.  Your code snippet above would render a certain number of frames on the CPU as fast as possible and then wait for the GPU to catch up.  Wouldn't this lead to jerkiness in the input sampling and simulation?  Would you just add some timer code to the above to delay the next iteration of the game loop if necessary?  Or is it more complex?

-potential energy is easily made kinetic-

The code that I posted will let the CPU get no more than 1 frame ahead of the GPU. After the CPU submits command lists to the direct queue, it waits for the previous GPU frame to finish. So if the GPU is taking more time to complete a frame than the CPU is (or if VSYNC is enabled), the CPU will be effectively throttled by fence and will stay tied to the GPU's effective framerate. 

In my experience, frame pacing issues usually come from situations where the time delta being used for updating the game's simulation doesn't match the rate at which frames are actually presented on the screen. This can happen very easily if you use the length of the previous frame as your delta for the next frame. When you do this, you're basically saying "I expect the next frame to take just as long to update and render as the previous frame". This assumption will hold when you're locked at a steady framerate (usually due to VSYNC), but if your framerate is erratic then you will likely have mismatches between your simulation time delta and the actual frame time. It can be especially bad when missing VSYNC, since your frame times may go from 16.6ms up to 33.3ms, and perhaps oscillate back and forth.

I would probably suggest the following for mitigating this issue:

  1. Enable VSYNC, and never miss a frame! This will you 100% smooth results, but obviously it's much easier said than done.
  2. Detect when you're not making VSYNC, and increase the sync interval to 2. This will effectively halve your framerate (for instance, you'll go from 60Hz to 30Hz on a 60Hz display), but that may be preferable to "mostly" making full framerate with frequent dips.
  3. Alternatively, disable VSYNC when you're not quite making it. This is common on consoles, where you have the ability to do this much better than you do on PC. It's good for when you're just barely missing your VSYNC rate, since in that case most of the screen will still get updated at full rate (however there will be a horizontal tear line). It will also keep you from dropping to half the VSYNC rate, which will reduce the error in your time delta assumption.
  4. Triple buffering can also give you similar results to disabling VSYNC, but also prevent tearing (note that non-fullscreen D3D apps on Windows are effectively triple-buffered by default since they go through the desktop compositor)
  5. You could also try filtering your time deltas a bit to keep them from getting too erratic when you don't make VSYNC. I've never tried this myself, but it's possible that having a more consistent but smaller errors in your time delta is better than less frequent but larger errors. 

Hopefully someone else can chime in with more thoughts if they have experience with this. I haven't really done any specific research or experimentation with this issue outside of making games feel good when they ship, so don't consider me an authority on this issue. :)

This topic is closed to new replies.

Advertisement