Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

1313 Excellent

About backstep

  • Rank

Personal Information

  • Role
  • Interests

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. I noticed this thread this morning and it kind of stuck with me all day.  I'm curious about my understanding of descriptor heap usage and whether I'm missing something.     This evening I went back and checked a few of the MS samples because I thought I remembered them using a single descriptor heap, and indeed so far as I can tell they use the same desc heap for every frame and every commandlist within those frames.  Even likely candidates still stick to a single desc heap, such as the nbodygravity sample (separate graphics and async compute command queues/lists), or the multithreading sample with 2 decent sized commandlists in each of the 3 worker threads (shadow and main pass in each thread).     I'm having trouble seeing why you might want to use multiple descriptor heaps.  Looking at the ways you could implement multiple heaps:   heap-per-frame:  With a desc heap per-backbuffer, lets say 2, alternating each frame, your SRV and UAV descriptors will be duplicated, the CBV descriptors will be unique.  You can generally split constants into groups by the frequency they're bound; per-draw and per-frame, and really per-draw constants shouldn't be in a descriptor heap or set through a table, but instead set directly with a root CBV.  That only leaves your per-frame constants like camera/light viewproj matrices, viewport size, etc, which are relatively few compared to SRVs/UAVS.  It seems you're doubling your total desc heap size for virtually no benefit, since with a single heap the only descriptor offsets you need to adjust between frames are your few per-pass constants.     heap-per-thread  With a separate heap per-thread, unless you duplicate all the SRV/UAV/CBVs across all heaps (identical heaps), you can't process any pass in parallel (break a pass's draw calls into chunks and have each thread build the corresponding command list).  You'd need to know ahead of time which thread is processing which draw call in each pass to ensure the corresponding views are available in that thread's desc heap.  So per-thread desc heaps make little sense.   heap-per-pass(cmdlist?)  I'm guessing giving each render pass it's own desc heap was the meaning of giving each cmdlist it's own heap.  With this you could at least process each pass's draw calls across multiple threads, unlike with the above.  But you'd still have the issue of needless duplication any time two different passes utilize the same resource views.  It also seems like it would add complexity in other ways (e.g. in a render queue, a material would have multiple handles for one of it's textures if it can be used in multiple passes?).  It wouldn't have a downside for passes that only use completely unique resource views unrelated to any other passes, admittedly, but that seems a very narrow use case.   The documentation has some detailed guidance on descriptor heap usage in the "switching" and "management" sections here.  There's also a note about using too many very large descriptor heaps on this page.   I really am curious if I'm missing something.     Lastly, I realize this will only seem more ignorant, but as an aside on sampler desc heaps, when would you ever need 2048 unique samplers??  Even with a complex frame and supporting all different levels of maxaniso with a selection of filters, address modes, comparisons and LoD limits it would only be several 100s.  Additionally in some of those cases it would make more sense to instead use static samplers if a shader always uses the same samplers? (also saving a DWORD in the root sig size).
  2. Perhaps someone more knowledgeable can comment on how correct or optimal this is, but in both D3D11 and now D3D12 I've always used an intermediate texture2D resource for my compute output (bindflags uav and srv usage, accessed as RWTexture2D in compute shader).     That's then transitioned to D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE and copied into the actual back buffer by rendering a fullscreen triangle using the normal pipeline (accessed as Texture2D in pixel shader), with the backbuffer bound as the RTV.  The UI is then rendered into the backbuffer etc.  There's probably a way to do a more direct copy if the intermediate resource and backbuffer share the same dxgi_format, although usually my compute output is still in a HDR format so the pixel shader is required for tone mapping anyway.
  3. backstep

    [D3D12] Texture Upload Heap

    I think I came across a similar issue when trying to use textures in a pre-release version of D3D12 a few months back. You can read the details on page 4 of the D3D12 documentation thread, although it might not be relevant. Assuming you're using VS2015, you can use the graphics debugger to capture a frame and inspect your "texture" default heap committed resource. If it's the same problem then the texture mip levels will be garbled and/or black. The problem was that directly copying subresources from an upload heap buffer to a default heap buffer leaves texture subresources unable to be correctly accessed by the GPU from a default heap, and I believe (though I may be wrong) this is because the GPU expects default heap texture data to be laid out in a hardware specific way. The solution was rather than using generic update/copy/writesubresource methods when copying textures to a default heap (as you would for non-texture buffers), you have to copy texture subresource data using CommandList::CopyTextureRegion. That way the GPU copies the subresource data to the default heap with the layout the hardware expects for texture accesses from GPU-only memory. Hope it helps, though if anyone knows it's changed since, I'd be glad of the updated info also.
  4. There was a similar issue with DXGI 1.3 and Direct3D 11.2, though I can't find the msdn page about it immediately.  Essentially the dxgi format for swapchain buffers is restricted to a few basic formats.   The solution is the same for D3D12 as it was for D3D11.2.  You leave the swapchain format as DXGI_FORMAT_R8G8B8A8_UNORM, but change the render target view's format to DXGI_FORMAT_R8G8B8A8_UNORM_SRGB.  Basically you tell the rendering pipeline to access the non-srgb buffer as if it was srgb.     Edit:   Tracked down the msdn page that mentions it: DXGI_MODE_DESC.  The last two paragraphs note the format restrictions for flip mode swap chains, and mentions using an SRGB version of that format in the RTV for gamma correct rendering with them.
  5.   You should be able to use ID3D12Device->GetResourceAllocationInfo() using the D3D12_RESOURCE_DESC you've filled out for the texture2D resource (the rendering resource) you want to copy into after writing the texture data to an upload buffer.  I tested it out a while back and if I remember rightly it does account for the resource alignment (64KB), subresource alignment (512 byte) and the row pitch alignment (128 byte).  I think you can check it out in the last sample I posted in this thread, that was an update for the API changes in 10069 (just search "getresourcealloc" in textureloader.h).   You can use the result from that GetResourceAllocationInfo call to ensure you have enough space in your upload buffer resource, before you start sending any data to the GPU.
  6. Thanks for the replies, I had a google around today, and gained some better understanding, I think. I had a look at how Nitrous engine works with mantle for per-draw/frame data (i.e. constants, slide 27), and in turn how Mantle provides CPU access to GPU memory (page 27).  Also had a quick look into how a CPU actually accesses mapped PCIe device memory.    Back to D3D12, I think there's some misunderstanding with the vague descriptions of the L1 and L0 memory pools.  First the simpler of the two, the L1 pool is described as "discrete GPU only".  It seems like this is intended to mean L1 can only be used by a discrete GPU, that is all it means, and not that it is the sole memory pool to use the device memory of an available discrete GPU.   If you don't assume that implied L1 = GPU memory, L0 = main memory absolute distinction for discrete GPUs, then things make a lot more sense.  When a discrete GPU is being used, memory allocated from an L0 pool can still be GPU memory.  CPU writes to mapped L0 pool memory can go directly across the PCIe bus to the GPU memory, and not into main memory then "fetched" out of it by the GPU using DMA into main memory.   If L0 and L1 pools are both GPU memory, why make the distinction? why is L0 slower for GPU read access?  I can only guess it's because non-exclusively accessed memory is almost always slower, you can't reliably cache it, etc.   So in short, I think perhaps there's a bad assumption about where L0 memory pools for discrete GPUs physically reside.        
  7. Are these the slides you saw about the Nitrous engine with D3D12? http://www.slideshare.net/mistercteam/d3-d12-anewmeaningforefficiencyandperformance-46176727 (GDC 2015, AMD/MS/Oxide) I could be totally wrong, but my understanding is as follows. You can never map L1 memory for CPU access (see slide 34), you can only map L0 for CPU write-combine or readback. I'm reasonably certain of this. The section from Oxide Games about their Nitrous engine starts at slide 40, and their choice of language seems confusing, at least to me. On slide 44, when they talk about "GPU memory" being write-combined I assume they mean L0 memory, so it would be more accurate to say "GPU-accessible memory". On slide 45 when they say "if stored in CPU memory with GPU fetch" what I think they mean is stored in CPU-only main memory, and then copied to L0 graphics memory later in the frame (e.g. in D3D11 with a dynamic map right before the draw command, or even in D3D12 by waiting until you're building your command lists to fill in the upload buffer). You can see from their diagram on slide 46 what they mean by "avoiding the extra copy". Their game simulation writes the per-draw data directly to persistently mapped L0 memory, rather than appending it to some CPU-only structure that is then copied to mapped L0 memory later in the frame by the renderer. This saves in their case 3GB/s of writing plus another 3GB/s of reading main memory. Again this is just as I understand it, but the bit about _mm_stream_si128 (slide 44 again) is to do with how the CPU normally caches data written to memory in it's L2 and L3 data caches (to speed up any following read of it). Since you know the per-draw data sent to the GPU is never going to be read by the CPU again, you can use _mm_stream_si128 rather than memcpy to prevent the data being written from being cached by the CPU, hopefully letting some useful data stay in the L2/L3 caches instead. If I do have any of that wrong, then someone let me know, thanks. Oh, if you're using the original PowerPoint file rather than that web slideshow, you might need to add 1 to each slide number I mentioned (slide 12 is missing).
  8. No problem, we're all learning with the new API. The short version is you have the GPU copy the upload heap resource to a default heap resource. There are various functions for that listed in the "Copying" section of the page I linked before. One thing to quickly mention about that page is that I don't believe the section on mapping resources is necessarily correct (I think it's adapted from the D3D11 docs, since it states mapping a resource denies the GPU access, which isn't really true for D3D12). The details of doing the resource copy from an upload heap resource to default heap resource are up to you, there's a few ways to approach it. You could create temporary committed upload resources (committed means individual heap per resource) for each buffer, and release each one once the GPU has copied the data to their respective default resources. Or you could create a more permanent committed upload resource and sub-allocate from it, re-using the upload resource as the copy source used to initialize your default resources. Either way you need to make sure the GPU has completed the copy operation before either releasing the individual committed upload resources, or reusing that part of the sub-allocated upload resource. The first option is probably easier to understand coming from D3D11, but it's also a lot of creating/mapping/releasing temporary resources. The second one is a bit more elegant and seems like the recommended approach, where you create the upload resource once, map once, and don't even need to release it straight afterward (i.e. you can reuse it for your constant data once you're done uploading data for default resources). For more info on that second approach using sub-allocation there's some good information in the preliminary docs: Suballocation Within Heaps Just don't be a dummy like me and mistake their m_spUploadHeap for an ID3D12Heap when it's actually an ID3D12Resource.
  9. You can't map a default buffer, it's GPU access only, you have to use an upload buffer. So desc.Properties.Type = D3D12_HEAP_TYPE_DEFAULT should be changed to D3D12_HEAP_TYPE_UPLOAD. There's more information here, in the heap types section: https://msdn.microsoft.com/en-us/library/dn899216(v=vs.85).aspx Edit: I should really use the correct nomenclature. What I meant to say is that you can't map a resource that was created on a default heap, you can only map resources created on upload heaps (or readback heaps).
  10. Thanks for posting about the fullscreen changes.   The issue with fullscreen using WARP is because the warp device doesn't own the output device. The debug layer will give an error alongside the _com_error about it - "The swapchain's adapter does not control the output on which the swapchain's window resides.". A trick to get around that is to have a WARP device control the output device, by changing your display adapter in window's device manager to Microsoft Basic Display Adapter. Since the basic display adapter always renders with WARP, you need to create a normal hardware device in your application code like you would for a real display adapter, and ensure WARP isn't enabled in the DirectX Control Panel either. I've attached the fullscreen2 project refactored for the 10069 SDK. I also added a new global variable g_bbCount to make it easy to play around with how the backbuffer count affects framerates. It defaults to 4 backbuffers and using present(0,0), which gives a 5-6ms frametime for me (I just used alt-f5 to start with VSGD to see frametime). If you want to enable DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT you need to set the flag in both the init function and in the resize function at the bottom of main.cpp.
  11. I mentioned that in my last post, but here's the exact method for using the DirectX Control Panel to enable the debug layer. In VS2015 open the Debug menu - Graphics - DirectX Control Panel. In the panel you need to add your executable (e.g. x64\debug\project1.exe) then select the Force On radio button. The checkbox for WARP is at the bottom. When upgrading from 10041 and VS2015 CT6 to 10074 and VS2015 RC yesterday the included windows SDK failed to install with VS2015 RC, which is why I linked the windows SDK seperately in my post. Today I did a clean install of 10074 with VS2015 RC, and the VS2015 RC setup installed the windows SDK successfully this time, including the 10069 headers for D3D12 along with the sdklayer and WARP dlls. Just thought I'd give a heads up about that, seems to be a CTP6->RC upgrade issue. I do have a quick question - Alessio1989 you were hinting about changes with fullscreen swapchains in these newer bits a week or so ago. Now they're available, and I've refactored for the 10069 SDK, the one change I've found is that Presenting with 0 interval in fullscreen no longer unlocks the framerate as it did with the 10030 SDK. Is that the change you meant? With 10069 both windowed and fullscreen presents behave identically, only increasing the back buffer count allows multiple presents per vertical sync even in fullscreen now. Edit: Just to contribute to the multi-GPU discussion (though I'm a little skeptical about how many dGPU gamers currently own D3D12-compatible iGPUs), the build talk from yesterday just got posted Advanced DirectX12 Graphics and Performance, and although it shares the same name as the one from GDC 2015, the content is totally different and mostly concerns multi-GPU rendering this time. Also it seems to indicate the D3D12 API as it is now is finalized, so that's good news.
  12. Thanks for the link to the upcoming Visual Studio Tools, with any luck they post them this week during/after Build.   If anyone non-EAP wants to check out the recent changes to D3D12 with the new Windows 10 build 10074, compatible headers are already out there.  You can get the 10069 headers here (It'll also install build 10106 of the sdk layer and warp driver in system32):   Windows Software Development Kit (SDK) for Windows 10 Insider Preview Loading up a project based on the march preview headers, there are quite a few changes, but I managed to get a project written for the March Preview (10041) working with the new headers in about an hour (the fullscreen textured triangle project). What seems like a lot of changes are mostly just renaming enums to be more verbose/descriptive/accurate (e.g. D3D12_RESOURCE_USAGE_DESC is now D3D12_RESOURCE_STATES, D3D12_INPUT_PER_VERTEX_DATA becomes D3D12_INPUT_CLASSIFICATION_PER_VERTEX_DATA, etc.) One big change is to device creation, D3D12CreateDevice takes only 4 arguments now, dropping the driver type, device flags, and sdk version. I'm not sure of the recommended way of reproducing the dropped arguments' functionality, but I used the DirectX Control Panel to enable the debug layer and/or force a WARP device when I needed them. The most time consuming change, and I don't know if this is limited to just the 10069 headers, or if they're just in a different header now, is that a lot of the helper functions appear to have been removed. Not only are the CD3D12_ structs that initialize default values removed (e.g. CD3D12_HEAP_PROPERTIES), but also the helper functions that were part of the API structs such as D3D12_CPU_DESCRIPTOR_HANDLE::MakeOffsetted, and D3D12_ROOT_PARAMETER::InitAsDescriptorTable and all the rest. Basically all the API structs are now pure C structs with no member functions. You can at least discover the defaults those missing functions set by checking the 10030 D3D12.h that was installed by the Visual Studio Tools for the March Preview of win10. One other nice change appears to be a bugfix for fullscreen swapchains, at least from my point of view. I think it's on the previous page of this thread, but I posted an issue where IDXGISwapChain3::GetCurrentBackBufferIndex would give the wrong active back buffer if the swapchain was in fullscreen (always reported n-1, where n is the actual active back buffer). Well it looks like that is fixed in this build, fullscreen reports the same correct index as windowed mode, no need to track the active back buffer manually anymore.
  13. Thanks for expanding on your earlier post.  I'd agree that increasing the maximum frame latency so you can use a huge number of backbuffers isn't a very practical workaround, just a waste of vram.   I'm kind of surprised to hear you can't use full-screen with a flip model swapchain in D3D12?  I understand if you've heard that directly from microsoft that you can't really confirm it though.  You can definitely use a flip model swapchain in fullscreen with D3D11, you just have to take some extra care with the transition to and from fullscreen.  My understanding is that even with a non-flip-model swapchain (e.g. DXGI_SWAP_EFFECT_DISCARD), that for fullscreen presents DXGI will "flip" the backbuffer to the front buffer, rather than blit it, providing they have the same resolution. There's more info here. Essentially the choice of swap effect only concerns whether the backbuffer is blitted or flipped to the DWM in windowed mode. I'm trying to read between the lines as to how that has changed with D3D12. I know technologies like g-sync and free-sync currently rely on fullscreen to work at all. So if the docs recommend you use the flip model, and you can't use fullscreen with the flip swap effect (although warp lets you)... then fullscreen output doesn't use the same swapchain? I guess I'll take the hint and wait to see what news comes out of the Build event at the end of the month.
  14. Well now I'm curious what the internal news might be. I know flip_sequential will limit you to the desktop vsync due to the nature of how the flip model works. I've been using it for desktop apps for a couple of years with D3D11, since it was part of DXGI 1.2 in Windows 8. So far as I always understood it you are limited to vsync because that's the display interval used by DWM, and with the flip model the backbuffer is given directly to DWM, rather than copied to an intermediate surface for the DWM to use. So you have to wait at present for the DWM to hand the backbuffer back after the next desktop refresh, when using the flip model. I think (?) increasing the number of backbuffers in the swapchain and using the DXGI_PRESENT_DO_NOT_WAIT present flag will stop helping after 4 buffers because by default the driver is set to queue a maximum of 3. You can maybe override it with IDXGIDevice1::SetMaximumFrameLatency. I never pursued that myself though. I do know that changing the output to fullscreen allows flip_sequential to present with a 0 sync interval. It makes sense since in fullscreen no desktop composition is required, so it avoids the DWM's desktop refresh rate. From testing with the warp driver it still (currently) works the same in D3D12 as it did with D3D11. In that D3D12-Fullscreen2 project I attached to an earlier post, if you use a warp driver and change the present interval to 0, then using Alt-F5 to start the app, and Alt-Enter to go fullscreen, the diagnostic log shows the framerate goes way above vsync. you might need to comment out the draw command though, since warp is so slow. I'm a bit puzzled why you would need to present faster than vsync in windowed mode though. Like I said I've been using the flip model since windows 8 came out, and it's never come up. Perhaps to roughly measure performance? I've used fullscreen frametimes occasionally for that, but typically you need more information so I'd use queries or a frame capture from VSGA/NSight/GPA. If it's an output latency thing, then 16ms is hard to notice unless you're using VR perhaps, in which case you wouldn't be using any desktop composition, right? I have a terrible feeling I've been overlooking something for the last couple of years!  
  15. That certainly sounds like a driver problem, that's rough. On the off-chance your issue is related (the driver is crashing when your render code is attempting to access a backbuffer being currently presented), I just updated my sample with a workaround for the fullscreen swapchain not rotating correctly with WARP (and possibly all drivers). It's attached to this post. Basically SwapChain::GetCurrentBackbufferIndex is giving a misleading value when the swap chain is fullscreen. I decided to take a look at what happens with a 3 buffer swapchain, and while windowed mode gives the expected 0, 1, 2, 0, 1, 2.. buffer index order, fullscreen mode gives 0, 1, 0, 1.. as the index order. GetCurrentBackbufferIndex skips the last buffer index in fullscreen, essentially, but internally the swapchain is rotating through all of the buffers correctly. Luckily you can just ignore that function's output and track the active index yourself. After a call to SwapChain::ResizeBuffers it appears to always use buffer index 0 for the next frame presented. I'm using a 2 buffer swapchain normally, so each frame you can bitwise XOR against 1U to flip the index ready for the next frame, and reset your index to 0 whenever you resize the swapchain buffers.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!