ZachBethel

Members
  • Content count

    117
  • Joined

  • Last visited

Community Reputation

921 Good

About ZachBethel

  • Rank
    Member

Personal Information

  • Interests
    Programming
  1. That's what I thought. The solution I went with is to keep a map of image descriptor hash to resource allocation info. It cut down on the cost by 3x. Thanks!
  2. Hey, I'm working on a placed resource system, and I need a way to determine the size and alignement of image resources before placing them on the heap. This is used for transient resources within a frame. The appropriate method on ID3D12Device is GetResourceAllocationInfo. Unfortunately, this method is quite slow and eats up a pretty significant chunk of time. Way more than I would expect for just returning a size and alignment (I'm using a single D3D12_RESOURCE_DESC) each time. Is there a way I can conservatively estimate this value for certain texture resources (i.e. ones without mip chains or something)? Thanks.
  3. Yeah, I was mistaken. Ibelieve I was confused by the fact that most hardware typically has a single hardware graphics queue. At any rate, the issue is that I was querying GetCompletedValue on my fences at a time when I thought all previous work would have completed. This was not the case.
  4. Hey all, I'm trying to debug some async compute synchronization issues. I've found that if I force all command lists to run through a single ID3D12CommandQueue instance, everything is fine. However, if I create two DIRECT queue instances, and feed my "compute" work into the second direct queue, I start seeing the issues again. I'm not fencing between the two queues at all because they are both direct. According to the docs, it seems as though command lists should serialize properly between the two instances of the direct queue because they are of the same queue class. Another note is that I am feeding command lists to the queues on an async thread, but it's the same thread for both queues, so the work should be serialized properly. Anything obvious I might be missing here? Thanks!
  5. I'm reading through the Microsoft docs trying to understand how to properly utilize aliasing barriers to alias resources properly. "Applications must activate a resource with an aliasing barrier on a command list, by passing the resource in D3D12_RESOURCE_ALIASING_BARRIER::pResourceAfter. pResourceBefore can be left NULL during an activation. All resources that share physical memory with the activated resource now become inactive or somewhat inactive, which includes overlapping placed and reserved resources." If I understand correctly, it's not necessary to actually provide the pResourceBefore* for each overlapping resource, as the driver will iterate the pages and invalidate resources for you. This is the Simple Model. The Advanced Model is different: Advanced Model The active/ inactive abstraction can be ignored and the following lower-level rules must be honored, instead: An aliasing barrier must be between two different GPU resource accesses of the same physical memory, as long as those accesses are within the same ExecuteCommandLists call. The first rendering operation to certain types of aliased resource must still be an initialization, just like the Simple Model. I'm confused because it looks like, in the Advanced Model, I'm expected to declare pResourceBefore* for every resource which overlaps pResourceAfter* (so I'd have to submit N aliasing barriers). Is the idea here that the driver can either do it for you (null pResourceBefore) or you can do it yourself? (specify every overlapping resource instead)? That seems like the tradeoff here. It would be nice if I can just "activate" resources with AliasingBarrier (NULL, activatingResource) and not worry about tracking deactivations. Am I understanding the docs correctly? Thanks.
  6. A retrospective on the Infinity project

    Please tell me you're not going to be the only engineer on this. That just isn't working out for you, as brilliant as you are. ;)
  7. Is it valid behavior to map a region of a read back resource while simultaneously writing to a disjoint region via the GPU? I've got a profiler subsystem with a single read back buffer that is N times the size of my query heap for N frames. The debug SDK layer gives a warning that the subresource is mapped while writing from the GPU.
  8. [D3D12] Debug validation weirdness

    It turns out it's the UAV barrier that's barking at me, and it only seems to happen if I do a UAV barrier on a command list without first transitioning the resource to a UAV in that command list, which seems wrong.
  9. I've got a scenario where I am building a command list that involves using UAVs. The UAV is transitioned to the Unordered Access state in a prior command list, like so:   Command List A: Transition NonPixelShaderResource -> UnorderedAccess   Command List B: UAV barrier ClearUnorderedAccessViewUint Dispatch more UAV barriers   Direct Queue: (A, B)   When I try and queue a UAV barrier on the later command list, I get this error spewing:   D3D12 ERROR: ID3D12CommandList::ClearUnorderedAccessViewUint: Resource state (0x0) of resource (0x00000242CA4635A0:'Histogram') (subresource: 0) is invalid for use as a unordered access view.  Expected State Bits: 0x8, Actual State: 0x0, Missing State: 0x8. [ EXECUTION ERROR #538: INVALID_SUBRESOURCE_STATE] D3D12 ERROR: ID3D12GraphicsCommandList::ResourceBarrier: Before state (0x8) of resource (0x00000242CA4635A0:'Histogram') (subresource: 0) specified transition barrier does not match with the state (0x0) specified in the previous call to ResourceBarrier [ RESOURCE_MANIPULATION ERROR #527: RESOURCE_BARRIER_BEFORE_AFTER_MISMATCH]   Is the debug layer just over validating? Or is there actually an issue here? For one, the error doesn't really make sense, if I remove the UAV barrier call the errors stop, but my resource is definitely not in the common state (0x0). I get this error even when I create the resource in the UnorderedAccess state.     Besides, how can the debug layer know I haven't transitioned the resource properly before I call ExecuteCommandLists? A prior command list could do the transition.   Has anyone encountered this issue before?
  10. A thing I'm struggling with right now is how to handle mapping of resources across multiple command lists.   i.e. // Thread 1 ConstData cd; cd.data = 1; ptr = consBuffer.Map(); memcpy(ptr, cd); constBuffer.Unmap(); CL1->draw(obj1); cd.data = 2; ptr = consBuffer.Map(); memcpy(ptr, cd); constBuffer.Unmap(); CL1->draw(obj2); // thread 2 CL2->draw(obj3) // use const buffer being written in thread 1 // submission thread: CL1 -> CL2 One approach I've seen is to cache changes to the resource in a command list-local cache, and then update the contents of the buffers when the command lists are serialized to the submission thread.
  11. [D3D12] Command list submission ordering

    Great, that's what I expected. I feel like a lot of documents assume you know that and gloss over it.   On that note, when building a task graph, it seems like it's wise to statically bake out your high level render passes on your queue submission thread, batch up all the command lists in those passes (e.g. wait until all your Z-prepass lists come in, for instance), and then submit in dependency sorted order (wait to submit g-buffer until z-prepass group has been submitted).
  12. When you submit command lists to a command queue, what ordering guarantees / expectations do you have?   According to MSDN:   GPU work submission To execute work on the GPU, an app must explicitly submit a command list to a command queue associated with the Direct3D device. A direct command list can be submitted for execution multiple times, but the app is responsible for ensuring that the direct command list has finished executing on the GPU before submitting it again. Bundles have no concurrent-use restrictions and can be executed multiple times in multiple command lists, but bundles cannot be directly submitted to a command queue for execution. Any thread may submit a command list to any command queue at any time, and the runtime will automatically serialize submission of the command list in the command queue while preserving the submission order. That last sentence is where I'm confused. Is it the case that if I build N command lists and call ExecuteCommandLists(...) with an array of those N command lists, that they are processed in order? That much seems to be true. The fuzzier part for me is how transition barriers and fences play into the submission order. Say I have a Z-Prepass and a shadow pass, and then some G-Buffer pass. Assuming I transition barrier everything correctly, am I expected to submit my Z-Prepass / Shadow pass command lists before the g-buffer command lists? That would basically mean I have to schedule my submission thread to wait for all the precursor work to come in from the job system before it can submit. This is what I am expecting that I have to do, but it's pretty unclear to me. It doesn't help that none of the samples online actually do a job-system based multithreaded demo :) I would love an elaboration on how the driver actually schedules the command list work. Thanks!
  13. Hey all,   I'm reading up on render passes, which seem like a powerful concept to clue the driver into the exact path of your render pipeline. In the spec they explain that the driver can react to ordering constraints to insert transition barriers.   Something I'm confused about is how render passes relate to command lists and command list submission.   For one, render passes form a DAG. Do I still have to submit command lists in dependency sorted order with respect to render passes? I would expect so, but I wasn't able to find any specific details on that.   Secondly, what's the granularity of a command list to a render pass? Can a command list span several render passes? (through multiple begin / end blocks)? Can a render pass be composed of several command lists (and if so, does each one inherit the begin / end state from the previously submitted list)?   If you understand the details of this I would to get your input.   Thanks!
  14. DX12 Descriptor binding: DX12 and Vulkan

    I've been thinking more about this, and I've come to realize some things.   I did some investigation into how some real workloads are handling the root signature. I found that a vast majority of what I saw have a structure similar to this:   DX12 style binding slots:   For bucketed scene draws:   0: Some push constants 1: per draw constant buffer 2: per pass constant buffer 3: per material constant buffer 4: A list of SRVs   For various post processing jobs:   0+ constant buffers simple table of UAVs simple table of SRVs   I didn't find any use cases where different regions of the same descriptor table were used for different stuff... for the most part is seems a simple list of SRVs / UAVs is enough.   I also realized that Vulkan has the strong notion of a render pass, and that UAVs could be factored into render passes as outputs (which are then transitioned to SRVs).   To me, it seems like having constant buffer binding slots, a way to bind a list of SRVs to the draw call, and a way to bind a list of UAVs to a render pass is enough to support most scenarios.   With regards to list allocation, it seems like descriptor layouts are going to be bounded by the application. Like you said, Witek902, you could just create a free list pool for descriptors and orphan them on update into a recycle queue. Static descriptor sets just get allocated once and held.   For DX12, you could model that same technique by allocating fixed size pieces out of a descriptor heap, or use some sort of buddy allocator. With the descriptor heap approach it becomes a bit weirder because it seems the ideal use case scenario is to keep the same heap bound for the whole frame.   I also read in Gpu Fast Paths that using dynamic constant buffers eats up 4 registers of the USER-DATA memory devoted to the pipeline layout. Apparently using a push constant to offset into a big table is more performant (I'm not sure how portable this is to platforms like mobile).    Anyway, just some thoughts.
  15. I've been reading up on how the resource binding methods work in Vulkan and DX12. I'm trying to figure out how to best design an API that abstracts the two with respect to binding descriptors to the pipeline. Naturally, the two API's are similar, but I'm finding that they treat descriptor binding differently in subtle ways.   Disclaimer: Skip to the bottom if you have a deep understanding of this already and just care about my specific question.   Explanation:   In DirectX, you define a "root signature". It can have push constants, inlined descriptors binding points, or descriptor table binding points. It also defines static samples on the signature itself. A descriptor table is a contiguous block of descriptors within a descriptor heap. Binding a table involves specifying the first descriptor in the heap to the pipeline. Tables can hold either UAV/SRV/CBV descriptors or SAMPLER descriptors. You cannot share the two within a single heap--and therefore table. Descriptor tables are also organized into ranges, where each range defines one or more descriptors of a SINGLE type.   Root Signature Example:   Descriptor Heap indirection:     In Vulkan, you define a "pipeline layout". It can have push constants and "descriptor set" binding points. You cannot define inlined descriptors binding points. Each descriptor set defines a set of static samplers. A descriptor set is a first class object in Vulkan. It also has one or more ranges of a SINGLE descriptor type.       Descriptor Sets:     Now, an interesting pattern I'm seeing is that the two API's provide descriptor versioning functionality for completely different things. In DirectX, you can version descriptors implicitly within the command list using the root descriptor bindings. This allows you to do things like specify a custom offset for a constant buffer view. In Vulkan, they provide an explicit UNIFORM_DYNAMIC descriptor type that allows you to version an offset into the command list. See the image below:       Question:   Okay, so I'm really just looking for advice on how to organize binding points for an API that wraps these two models.   My current tentative approach is to provide an API for creating buffers and images, and then explicit UAV/SRV/CBV/RTV/DSV views into those objects. The resulting view is an opaque, typeless handle on the frontend that can map to descriptors on DirectX 12 or some staging resource in Vulkan for building descriptor sets.   I think I want to provide an explicit "ResourceSet" object that defines 1..N ranges of views similar to how both the descriptor set and descriptor table models work. I expect that I would make sampler binding a separate API that does its own thing for the two backends. I would really like to treat these ResourceSet objects similar to constant buffers, except that I'm just writing view handles into it.   I need to figure out how to handle versioning of updates to these descriptor sets. In the simplest case, I treat them as fully static. This maps well to both DX12 and Vulkan because I can simply allocate space in a descriptor heap or create a descriptor set, write the descriptors to it, and I'm done.   Handling dynamic updates becomes complicated for both API's and this is the crux of where I'm struggling right now.   Both APIs let me push constants, so that's not really a problem. However, DirectX allows you to version descriptors directly in the command list, but Vulkan allows you to dynamic offsets into buffers. It seems like this is chiefly for CBVs.   So it seems like if I want to do something like have a descriptor set with 3 CBV's, and then do dynamic offsets, I have to explicitly version the entire table in DirectX by allocating some new space in the heap and spilling descriptors to it.   On the other hand, since Vulkan doesn't really have the notion of root descriptors, I'd have to create multiple descriptorset objects and version those out if I want to bind a single dynamic UAV.   Either way, it seems like the preferred model is to build static descriptor sets but provide some fast path for constant buffers, and that's the direction I think I'm going to head in.   Anyway, does this sound like a sane approach? Have you guys find better ways to abstract these two binding models?   Side question: How do you version descriptor sets in vulkan? Do you just have to pool descriptor sets for the frame and spill when updates occur?   Thanks!