MJP

Moderators
  • Content count

    8551
  • Joined

  • Last visited

  • Days Won

    1

MJP last won the day on September 12

MJP had the most liked content!

Community Reputation

19861 Excellent

About MJP

  • Rank
    XNA/DirectX Moderator & MVP

Personal Information

Social

  • Twitter
    @MyNameIsMJP
  • Github
    TheRealMJP
  1. DX12 DX12 and threading

    The code that I posted will let the CPU get no more than 1 frame ahead of the GPU. After the CPU submits command lists to the direct queue, it waits for the previous GPU frame to finish. So if the GPU is taking more time to complete a frame than the CPU is (or if VSYNC is enabled), the CPU will be effectively throttled by fence and will stay tied to the GPU's effective framerate. In my experience, frame pacing issues usually come from situations where the time delta being used for updating the game's simulation doesn't match the rate at which frames are actually presented on the screen. This can happen very easily if you use the length of the previous frame as your delta for the next frame. When you do this, you're basically saying "I expect the next frame to take just as long to update and render as the previous frame". This assumption will hold when you're locked at a steady framerate (usually due to VSYNC), but if your framerate is erratic then you will likely have mismatches between your simulation time delta and the actual frame time. It can be especially bad when missing VSYNC, since your frame times may go from 16.6ms up to 33.3ms, and perhaps oscillate back and forth. I would probably suggest the following for mitigating this issue: Enable VSYNC, and never miss a frame! This will you 100% smooth results, but obviously it's much easier said than done. Detect when you're not making VSYNC, and increase the sync interval to 2. This will effectively halve your framerate (for instance, you'll go from 60Hz to 30Hz on a 60Hz display), but that may be preferable to "mostly" making full framerate with frequent dips. Alternatively, disable VSYNC when you're not quite making it. This is common on consoles, where you have the ability to do this much better than you do on PC. It's good for when you're just barely missing your VSYNC rate, since in that case most of the screen will still get updated at full rate (however there will be a horizontal tear line). It will also keep you from dropping to half the VSYNC rate, which will reduce the error in your time delta assumption. Triple buffering can also give you similar results to disabling VSYNC, but also prevent tearing (note that non-fullscreen D3D apps on Windows are effectively triple-buffered by default since they go through the desktop compositor) You could also try filtering your time deltas a bit to keep them from getting too erratic when you don't make VSYNC. I've never tried this myself, but it's possible that having a more consistent but smaller errors in your time delta is better than less frequent but larger errors. Hopefully someone else can chime in with more thoughts if they have experience with this. I haven't really done any specific research or experimentation with this issue outside of making games feel good when they ship, so don't consider me an authority on this issue.
  2. That sounds like a bug in Nvidia's driver. In D3D11 the results of UAV writes should be visible to all other pipeline stages after the Distpatch completes, regardless of flags and whether or not you've used atomic operations.
  3. For The Order our compositing system was totally based on parameter blending, much in the same spirit as Disney's material system that was presented at SIGGRAPH 2012. The only exception was when compositing cloth-based materials, in which case we would evaluate lighting with both our cloth BRDF as well as with GGX specular and then perform a simple linear blend of the two results. As far as I know UE4 is generally doing parameter blending as well, but I've never worked with that engine so I'm not very familiar with the specifics. As you've already figured out, you can't really simulate true multi-layer materials with parameter blending alone. To do it "for real", you have to make some attempt at modeling the light transport through the various translucent layers (much like they do in that COD presentation). This generally requires some approximation of volume rendering so that you can compute the amount of light absorbed as it travels through the medium. For something like car paint, at the minimum you'll need to compute your specular twice: one for the light being reflected off of the clear coat layer, and another time for the light being reflected off of the actual metallic paint flecks. I'd probably start out with something like this: Compute the specular reflection off of the clear coat using the roughness and IOR (F0 specular intensity) Compute the amount of light transmitted (refracted) into the clear coat using fresnel equations and IOR Compute the intensity and direction of the view direction as it refracts into the clear cloat using fresnel equations and IOR Compute the specular reflection off of the metallic layer using separate roughness and IOR (and perhaps a separate normal map), and also using the refracted view direction Final result is ClearCoatSpecular + MetallicSpecular * LightTransmitAmt * ViewTransmitAmt This would not account for absorption in the clear coat, but it would give you the characteristic dual specular lobes. If you wanted to account for absorption, you could compute a travel distance through the clear coat by having a "thickness" parameter and computing the intersection of the light/view ray with an imaginary surface located parallel to the outer surface. Using that distance you could evaluate the Beer-Lambert equations and use the result to modulate your transmittance values.
  4. DX12 DX12 and threading

    That's not quite what I meant. You'll still want to signal your fence and wait on it every frame, you just need to wait on the value one frame later. The first frame you don't need to wait because there was no "previous" frame, but you do need to wait for every frame after that. Here's what my code looks like, minus a few things that aren't relevant: void EndFrame(IDXGISwapChain4* swapChain, uint32 syncIntervals) { DXCall(CmdList->Close()); ID3D12CommandList* commandLists[] = { CmdList }; GfxQueue->ExecuteCommandLists(ArraySize_(commandLists), commandLists); // Present the frame. DXCall(swapChain->Present(syncIntervals, syncIntervals == 0 ? DXGI_PRESENT_ALLOW_TEARING : 0)); ++CurrentCPUFrame; // Signal the fence with the current frame number, so that we can check back on it FrameFence.Signal(GfxQueue, CurrentCPUFrame); // Wait for the GPU to catch up before we stomp an executing command buffer const uint64 gpuLag = DX12::CurrentCPUFrame - DX12::CurrentGPUFrame; Assert_(gpuLag <= DX12::RenderLatency); if(gpuLag >= DX12::RenderLatency) { // Make sure that the previous frame is finished FrameFence.Wait(DX12::CurrentGPUFrame + 1); ++DX12::CurrentGPUFrame; } CurrFrameIdx = DX12::CurrentCPUFrame % NumCmdAllocators; // Prepare the command buffers to be used for the next frame DXCall(CmdAllocators[CurrFrameIdx]->Reset()); DXCall(CmdList->Reset(CmdAllocators[CurrFrameIdx], nullptr)); }
  5. DX12 DX12 and threading

    Before messing around with threading, 1 thing you'll want to do is make sure that the CPU and GPU are working in parallel. When starting out with DX12, you'll probably have things set up like this: Record command list for frame 0 -> submit command list for frame 0 - > wait for GPU to process frame 0 (by waiting on a fence -> Record comand list for frame 1 If you do it this way the GPU will be idle while the CPU is doing work, and the CPU will be idle while the GPU is doing work. To make sure that the CPU and GPU are pipelined (both working at the same time), you need to do it like this: Record command list for frame 0 -> submit command list for frame 0 -> record command list for frame 1 -> submit command list for frame 1 -> wait for the GPU to finish frame 0 -> record command list for frame 2 With this setup the GPU will effectively be a frame behind the CPU, but your overall throughput (framerate) will be higher since the CPU and GPU will be working concurrently instead of in lockstep. The big catch is that since the CPU is preparing the next frame while the GPU is actively processing commands, you need to be careful not to modify things that the GPU is reading from. This is where the "multiple command allocators" thing comes in: if you switch back and forth between two allocators, you'll always be modifying one command allocator while the GPU is reading from the other one. The same concept applies to things like constant buffers that are written to by the CPU. Once you've got that working, you can look into splitting things up into multiple command lists that are recorded by multiple threads. Without multiple threads there's no reason to have more than 1 command list unless you're also submitting to multiple queues. Multi-queue is quite complicated, and is definitely an advanced topic. COPY queues are generally useful for initializing resources like textures. COMPUTE queues can be useful for GPU's that support concurrently processing compute commands alongside graphics commands, which can result in higher overall throughput in certain scenarios. They can also be useful for cases where the compute work is completely independent of your graphics work, and therefore doesn't need to be synchronized with your graphics commands.
  6. It looks like you're binding your sampler state to the Domain Shader stage, but I don't see any code for binding the texture's shader resource view (SRV) to the DS stage. Are you doing that somewhere? If not, that would explain why you're getting all black as the result. BTW, the reason that you can't use Sample() in a non-pixel shader is because that function automatically computes the mip level for your. The way it does is by looking at the neighboring pixels and figures out how much the UV coordinates change from one pixel to the next (effectively computing the partial derivatives of U and V with respect to screen-space X and Y), and then uses that value to compute the appropriate mip level as well as the amount of anisotropic filtering (if enabled). This doesn't work in other shader types because there's no grid of pixels, so you have to manually specify the mip level or provide the partial derivatives.
  7. I don't have any comprehensive numbers at the moment, so I'll have to try to set up a benchmark at some point. I would guess that the difference would be pretty minimal unless you're uploading a very large buffer. For me it was also somewhat convenient to use the COPY queue since I already had a system in place for initializing resources using the COPY queue, and the buffer updates go through the same system. The IHV's have recommended using the COPY queue for resource initialization, since the DMA units are optimized for pulling lots of data over the PCI-e bus without disrupting rendering too much (which is necessary in D3D11 games that stream in new textures while gameplay is going on).
  8. There's some discussion on this very same topic going on in this thread, so you should check that out. Broadly speaking, there's two different ways of implementing dynamic buffers that can be updated by the CPU: Create the the buffer in an UPLOAD heap, and directly write to it from the CPU using the pointer that you get from Map. Create the buffer in a DEFAULT heap, and update it by first having the CPU write the data to a temporary buffer in an UPLOAD heap, and then kick off a GPU Copy operation to copy from the UPLOAD resource to the DEFAULT resource. The first one is easier to start with, since it's simpler. You just need to make sure that you don't write to a buffer that the GPU is currently reading from, which is most easily achieved by having N buffers where N is the number of frames in flight (usually 2). The second one is more useful for cases where the GPU is going to be reading the data multiple times, and/or the buffer is very big and will need maximum bandwidth. You do not want to use READBACK for this, that's intended for having the CPU read data that was written to by the GPU (as in having the CPU "read back" the data from the GPU).
  9. DX12 DX12 and threading

    We have a work-stealing task scheduler that spawns 1 thread for every core on the CPU (minus 1 for the main thread). Then we create a bunch of tasks for groups of draw calls, and throw them a the task scheduler. We've tried both 1 thread per logical core (Intel CPU's with hyperthreading have 2 logical cores for every physical core) as well as 1 thread per physical core, and we've generally found that trying to run our task scheduler thread on both logical cores to be somewhat counterproductive. But your mileage may vary. AMD has some code here that can show you how to query the relevant CPU information, Writing your own task scheduler can be quite a bit of work (especially fixing all of the bugs!), but it can also be very educational. There's a pretty good series of articles here that can get you started. There's also third-party libraries like Intel's Thread Building Blocks (which is very comprehensive, but also a bit complex and very heavyweight), or Doug Bink's enkiTS (which is simple and lightweight, but doesn't have fancier high-level features). Windows also has a built-in thead pool API, but I've never used it myself so I can't really vouch for its effectiveness in a game engine scenario. My general advice for starting on multithreading programming is to carefully plan out which data will be touched by each separate task. IMO the easiest (and fastest!) way to have multiple threads work effectively is to make sure that they never touch the same data, or at least do so as infrequently as possible. If you have lots of shared things it can messy, slow, and error-prone very quickly if you have to manually wrap things in critical sections. Also keep in mind that *reading* data from multiple threads is generally fine, and it's *writing* to the same data that usually gets you in trouble. So it can help to figure out exactly which data is immutable during a particular phase of execution, and perhaps also enforce that through judicious use of the "const" keyword.
  10. For my persistent "dynamic" buffers I like to have a "CPUWritable" flag that lets you have two different behaviors. If that flag is set, the buffer is allocated out of an UPLOAD heap and can be written to directly by the CPU. To make sure that the CPU doesn't overwrite something that the GPU is reading, the buffer is internally double-buffered, and the buffers are swapped when the contents are changed by the CPU. With this set up you can only flip the buffer at most once per frame (where a "frame" is denoted by a fenced submission of multiple command lists to the DIRECT queue, followed by a Present), so I have an assert to track which frame the buffer was last updated. If the CPUWritable flag is false, then the contents have to be updated by writing to temporary UPLOAD memory first, and then copying that to the actual buffer memory in a DEFAULT heap. However I do it a little differently than you're proposing, since I use a COPY queue to do the copy instead of using a DIRECT queue. Doing it on the copy queue is trickier since you have multi-queue synchronization involved, but the upside is that the copy can potentially start earlier and run alongside other graphics work (which you usually want to do for initializing static resources). To again avoid writing something that the GPU is reading from, I also double-buffer in this case and only allow at most 1 update per frame. For the temporary memory from an UPLOAD heap that's used as a staging area, I have a ring buffer that tracks fences to know when it can move the start pointer forward. With your approach of doing the copy on the DIRECT queue, the nice part would be that it will be synchronized with the graphics work on the GPU timeline. This means that you don't need to double-buffer, or do any synchronization beyond your barriers. But the downside is that the copy will happen synchronously with your graphics work, instead of "hiding" in other work. You'll also have to track your fence on the DIRECT queue to know when to free your chunk from the UPLOAD heap. For choosing between whether to keep your buffer in UPLOAD memory or copy into DEFAULT memory, the best choice most likely depends on how you access the data. If the data is small and you're not going to do repeated random accesses to it, UPLOAD is probably fine (this covers a lot of constant buffers). If the data is larger and you access it multiple times, then it's probably worth copying it to DEFAULT so that you get full access speeds on the GPU (something like a StructuredBuffer full of lights for a forward+ renderer would probably fall into this category). Anyway, I just wanted to share what I'm doing to give you a few ideas. I'm not claiming to have the best possible approaches here, so feel free to do what works best for you and your engine. EDIT: I forgot to add some links to my code for reference. You can find the buffer code here, and the upload queue code here. Just be aware that the descriptor management is a bit complicated since that code uses persistent bindless descriptor indices, so there's some jumping through hoops to make sure that the descriptor index doesn't have to change when the buffer is updated.
  11. If you would like to see a code example, you can look at my Profiler class: https://github.com/TheRealMJP/DeferredTexturing/blob/experimental/SampleFramework12/v1.01/Graphics/Profiler.cpp
  12. 3D Help: FPS limit ( no vsync)

    As others have mentioned, this forum is for discussing graphics and GPU programming. If you have a specific question related to programming and development then I would ask you to please provide some more detail about your issue. However, if you're just having an issue running a particular game then I will have to lock this thread.
  13. It's mentioned in the docs for CreateUnorderedAccessView: At least one of pResource or pDesc must be provided. A null pResource is used to initialize a null descriptor, which guarantees D3D11-like null binding behavior (reading 0s, writes are discarded), but must have a valid pDesc in order to determine the descriptor type. It's also mentioned here in the programming guide. One thing to watch out for is there's no way to have a NULL descriptor for a UAV or SRV that's bound as a root SRV/UAV parameter. In this case there's really no descriptor (you're just passing a GPU pointer to the buffer data), so you forego any bounds checking on reads or writes. Just like raw pointer access on the CPU, reading or writing out-of-bounds will result in undefined behavior.
  14. I just wanted to confirm that it's basically the same in D3D12 as it is in D3D11: you just need to specify which array slice the SRV will read from by filling out the "Texture2DArray.FirstArraySlice" and Texture2DArray.ArraySize" members of the D3D12_SHADER_RESOURCE_VIEW_DESC structure. You can also do this to make a view into a sub-array if you want, or to make a view that only sees a subset of the mip levels.
  15. I use code-gen for my own experimental framework. I write a C# file that contains a few classes, where each class represents group of settings. These settings use custom attributes to add additional UI parameters, like descriptions and min/max values. This file is processed by a small C# app that I call "SettingsCompiler", which invokes the C# compiler to compile the .cs file containing the settings. Once compiled into an assembly, the settings and their attributes are reflected by the settings compiler and pulled into a list. From there, the settings compiler generates C++ code for initializing the settings using my framework's runtime classes. It also generates a matching HLSL file containing the constant buffer layout, as well as C++ code for filling that constant buffer with the current settings values. The whole thing took a while to set up, but it's definitely worth the time savings now that I don't have to manually add and remove UI + cbuffer definitions. I was able to get it integrated nicely into a VS solution so that the SettingsCompiler is invoked as a custom build step right before the C++ code is compiled, so it's pretty transparent. I also rarely mess with that code now that it's working. If you want to have a look at my implementation, it's all up on GitHub under the MIT license. The older version that uses D3D11 and AntTweakBar is here, and the newer version that uses D3D12 and ImGui is here. Even if you go down the code-gen route, I would still recommend getting basic shader reloading to work. It's really nice being able to make quick changes and have them re-load without having to re-start the app. It's also not too hard to get going with basic timestamps. A lot of people will go down the path of using the Windows file watching API's for doing this, which certainly works but can be a real PITA to get right. But if you only have a few files to monitor, then you can just brute force it by checking the file timestamps occasionally.