Jump to content
  • Advertisement

SergioJdelos

Member
  • Content Count

    39
  • Joined

  • Last visited

Community Reputation

468 Neutral

About SergioJdelos

  • Rank
    Member

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. You could use a compute shader for it, you just need a UAV for that texture so you can read it and modify it.
  2. You should have one more buffer than the number of frames that you queue, since one of the buffer is the one that it's been presented. So for a 3 buffer swap chain, you should queue up to 2 frames. That being said, queuing 2 frames should be more than enough (you want to keep your GPU busy but also to have a reasonable latency). And for that error, if you are requesting each frame the right buffer, it should not happen.
  3. SergioJdelos

    [D3D12] Multiple command queues

      https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf   Anyway, probably I am confusing myself with older and not-public documentations... Copy only depth will hit a slow path if depth and stencil are interleaved (like D24S8 on nVidia). This has nothing to do with the DMA engine. On AMD hardware there is no D24 (it's always 32bit) and stencil and depth are not interleaved (ever) so you will never hit this case)
  4. SergioJdelos

    what good are cores?

    Actually, Quad channel makes sense in the HEDT platform. A mainstream i7 can have 4 cores, but the HEDT has up to 10 (and 22 in Xeons). So, if it has 8 cores, it will need 4 memory channels to have the same bandwidth per core as the mainstream i7 using dual channel. So a brand new 6950x (10 cores, HEDT) have less bandwidth per core than a 6700k (4 cores, mainstream)
  5. SergioJdelos

    MultiRenderTarget vs. UAV r/w in PS

    The problem is the order. Rasterizer Order Views fix that, but for normal UAVs the order of execution is not the same as the order of primitives/drawcalls. That being said, using ROVs instead of UAVs will be slower, and most likely slower than using MRTs. Edit, if this is a fullscreen pass, then it should be fine, but if there is more than one triangle rendering to a pixel, then you can't replace MRTs with UAVs.
  6. Have you profiled?  While I agree a kernel call takes more than a regular function call, it really depends on what work is occurring when. Yes, I did. And batching command lists (in three groups per frame, one for scene rendering pre scaleform, GUI with scaleform and one after scaleform for more GUI and patching of resource states) gave the best results in all the platform that I tested (GeForce 760, GeForce 980m, Radeon 290x, Intel Haswell HD4600, Broadwell HD6200 and Skylake HD520). I said two before because I forgot about Scaleform   The first batch contains all the GBuffer generation lists, shadows lists, lighting pass (is a light pre pass renderer, that we use on all of our platforms including GL ES 2 for IPad), SSAO/HBAO+, material pass (because is LPP), deferred decals, transparent meshes/FXs and post process (FXAA/CMAA, DoF, etc). Each CL is generated in a separated job running in parallel.    Then we render the GUI using Scaleform (which send its own command lists).   And the third call is to render the rest of the stuff and update some resource states (like going from Render Target to present for the back buffer).   We have a two frame latency with GPU, (I keep everything alive for two frames) and I can tell you that using GPUView I keep busy the GPU without bubbles but this is after TH2 update. Pre TH2 there were issues with that (but it was related to DXGI and the presentation system not DX12 itself).
  7. I read the opposite... that building commandlists is expensive while submitting them are relatively cheap.  In addition if you buffer your entire frames commandlists before submitting them it will take more memory and potentially(most likely) lead to an idle GPU waiting for work to do.   edit - although making lists with 100 draws does reduce the amount of submits.   Submit to the Command Queue, it's not cheap! This is a kernel call after all. On the other side any call to Command list are simple user calls and those are very cheap.    Having a whole frame buffered adds latency and yes it takes memory, but is the best way to have the GPU fully busy.    In any case you should profile! you can always use GPU View to see how CPU and GPU are working (among other tools), but my recommendation is to have a full frame of latency so the GPU can work at its full potential (unless perhaps you are working on VR where low latency is much more important).   Edit: There is one issue with Intel drivers where Reset command list/allocators is very expensive (and by very expensive I mean that it can take several ms each, I have cases of almost 5ms to reset where it takes less than 1ms to build the same CL). This doesn't affect all Intel GPUs (only Haswell/Broadwell) and it doesn't affect nVidia/AMD that much, but I do reset all my CL using jobs in parallel to other stuff.
  8. Also remember that it is suggested that you have (IIRC) 100 draws per command list. (12 for bundles)  Visibility can change on a per frame basis so reusing command lists isn't really optimal.   100 draw calls (and 12 for bundles) is minimum recommended, the idea is to avoid small command lists with 1 or 2 commands. That being said, in the game that I work on, we have some very small command lists (like post process) that have maybe 20 commands (if all the steps are on) and that is ok. Just be sure to dispatch all your command list in a single call to the queue (I think we have 2 per frame, one early with a lot of scene stuff and another later with the latest scene steps, post process, GUI, etc), dispatching to the queue is the expensive operation but if you batch your CL is not that bad. Building command list in DX12 is cheap, very cheap. So don't worry about it, but if you have 20000 elements to draw, you may want to split that in a bunch of it (to have more parallelism) like in 10 CL of 2000 elements, but don't go 2000 CLs with 10 elements each ;) Also, the Command Allocator, never shrinks on reset. So no, it doesn't free anything, but you can reuse the memory. Think as a vector that can grows but every time that you call clear it just set the size to 0 instead of releasing memory.  So, my recommendation is to stick 1-1 with CL and Command Allocators, and try to don't build CL too big (500 drawcalls is fine, 20000 is not) nor too small, unless you need it (think again in post process and the like).
  9. Yes, because the create methods (for SRV, UAV, CBV) just write to that memory, and that memory is read by the GPU when you dispatch the drawcalls, but is not copied to the command list, like the RTV, DSV are. Again, the UAV is a special case for clear, but not for normal rendering when you just bind the table. But if you want to copy descriptors (SRV, UAV, CBV), the source must be a non shader visible (so, read/write) heap. The target can be (and most likely be) a shader visible (cpu read only) heap.
  10. Hi, In order to clear an UAV, you need two descriptors from two different heaps: - The CPU descriptor from a non shader visible heap - The GPU descriptor from a shader visible heap For what I know, this is because some GPUs require the descriptor in GPU visible memory, while others use the data that is copy to the command list when you call the method (clear in this case), so the API force you to have both so it will run on any GPU. As a side note, CPU handles, from a non shader visible heaps, are copied to the command list when you call a method that use it (in this case ClearUnorderedAccessViewFloat) and for that it needs to be in CPU readable memory. I hope it helps!
  11. SergioJdelos

    D3D12 Best Practices

        Ideally, you should set one per command list (between reset and close), switching inside a command list is expensive. For most cases a single descriptor heap for SRV/CBV/UAV should be enough for an entire game, but you could have more. For example, since CreateShaderResourceView and the like, and CopyDescriptors are ID3D12Device's methods and not ID3D12GraphicsCommandList's methods those happen in CPU timeframe (I mean that those are not queued to the GPU) which means that you can't modify a Heap, if the GPU could be using it. For that reason, you could, for example, have one DescriptorHeap per frame that you track (2 or 3) and always modify the one that is being use for the frame that you are building, but don't touch the ones that are queued on the gpu. Eventually you need to copy/create those descriptors to those heaps too, you need to replicate and queue this stuff per heap. Currently, for our current game, we are not using more than one descriptor heap per type. But the game is not that complex (is a Diablo kinda game), and I know that I have plenty space available for memory management, so I build a bunch of allocators with different pattern on different blocks of the heap. But considering that the limit for the tier 1 heap is 1M descriptors, I don't think that I would need to have something much more aggressive anytime soon. Also, we bind textures and the like in a single table per material, so all our textures are packed close but we have many copies of the same descriptor. If you have very big worlds you may want to pack much more, and maybe going bindless. But if you are doing bindless, and the like, you may want to pack your descriptors closer, and for that, using a very long latency for slot re-usage, may not be a good idea, so the idea of multiple heaps with queued multi updates makes more sense. Also, if you consider to use async compute (and that compute may need to access the descriptors ) you may want to put those descriptors in another descriptor heap.   Hodgman cover the basics, but for unlocked frame rate I recommend you to look at this:  However, I have to admit that I found some issues with that, so for now I am using DXGI_PRESENT_RESTART as flag for Present only for benchmarking, but DXGI_PRESENT_RESTART is not recommended, it can cause some issues.
  12. SergioJdelos

    Two directX12 questions

    No, the NO_OVERRIDE flag is a promise that you do make to the API, that you won't modify the data until the GPU is done with it. In other words the protection of the data is in your hands, not the API nor the Driver. You can use a fence or a query, or you can rely on the fact that the GPU can't be more than 3 frames behind (not the best idea but usually works), so you can reuse the buffers when you know that the GPU is done with it. For Constant Buffers in D3D11 you should stick to Map_Discard, unless you are targeting 11.1 and Windows 8.1 only as Matias said. Creating a constant buffer per object per frame is not a good idea.
  13. SergioJdelos

    D3D12 Root Signatures for different shaders

    Yes, it is the same. But as I said it doesn't scale/work on Resource Binding tier 1 hardware. 
  14. SergioJdelos

    D3D12 Root Signatures for different shaders

    Yes, you can do that. In the root signature creation create the table for 2 textures, even if only one is used in one of those cases. In the long run, I think that is just better to go bindless, but if you want to run on Haswell or Fermi that is not an option (you need resource binding tier 2 to be flexible enough).
  15. As a note, the minimum alignment for buffers in D3D12 is 256 bytes, so that mean that at minimum your CB instance data will have a size of 256 bytes. So, even a single float4 (16bytes) for instance color, will require that you allocate 256 bytes. We have been porting our game to D3D12 from D3D11, and we are using map/discard for CB in our D3D11 renderer. So the easiest way for me was to build a custom constant buffer that mimics that behavior. For that, we use a stack allocator that work on pages that we allocate using a TagAllocator (look at the Naughty Dog GDC presentation for how they handle things in the port of Last of Us to PS4).  So, for each Map/Discard, we move in the stack, if the stack has no space for the block (or is the first time) we request a new page to the tag allocator (pages are 512kb in size). Also each time that we request a buffer, the data size is aligned to 256 (so if I request less than 256, is rounded to 256, if is 257 is rounded to 512 and so on). The allocator it self just set the label or tag to the pages on allocation, then all pages that have that tag can be released together (I have an auto tag, that is generated using the frame index, so since I queue 3 frames when I start a new frame I can release the pages with the tag "current frame - 3"). No that we use the tag allocators for many things, not just constant data (it is good for streaming too).   The tag allocator returns a handle (which contains the cpu pointer to write + the gpu address for binding), so is easy to use for both write and bind. For things that are static we don't use this approach (since I port that code before having the tag allocator). Instead I just create a buffer per static CB with space for 3 frames (since I may change the data, static for us just mean that it will not change all the time). A good thing is that you can do pointer math with the GPU and CPU address, so if you know were are you writing on the CPU and the start address of that buffer you can infer where you need to bind on the GPU. Going back to the alignment, I found very useful to pack constant data on the fly for DX12, so even when in DX11 I have several CBs per mesh (instance for vs, instance for ps, material for vs, material for ps, skinning, instance data, etc) in DX12 is just one per stage and I just stack the data (and #ifdef out that data in the shader). Is good because we need less buffers and we use the space better (less transfer too, keep in mind that transfer is always in blocks, at CPU level is a Cache line of 64bytes, at least in x86, but the GPU may need more and I think the 256 bytes restriction is because of that) but also because it helps to keep the same root signature for all meshes of a render pass, no matter how many CBVs needs in DX11. If I have optionals CBVs it could mean holes in the root signature or be forced to use tables for constant data (which is least convenient at least for dynamic data).  Btw, for very small data you can use root constants too, but just small stuff like indices and the like, don't try to push a matrix there.  
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!