pcmaster

Member
  • Content count

    208
  • Joined

  • Last visited

Community Reputation

987 Good

1 Follower

About pcmaster

  • Rank
    Member

Personal Information

  • Industry Role
    Programmer
  • Interests
    Programming
  1. DX12 DX12 Occlusion Queries

    Thank you for the article. It's very interesting, however in the engine (and rather types of games) I'm implementing DX12 into, we don't happen to be instancing that very much and that approach doesn't lower the CPU cost - the higher level still has to prepare the data for each draw, which isn't negligible. But the approach sounds very good for many applications.
  2. DX12 DX12 Occlusion Queries

    One last thought. By reading back the query results on CPU, I decide not to issue the draws already on CPU. Therefore I save the CPU time needed to prepare the constant buffers, descriptor tables, set other states, etc.. With GPU predication, I'd still have to prepare each draw, possibly in vain. This is all only valid for a "traditional" renderer without fancy on-GPU command list building.
  3. @piluve, you'll also have memory overhead/waste in the first table, because UAV u0, for example, will be always expected at offset 16+16=32, no matter if you used 16+16 CBVs+SRVs or much fewer. Also, you have to copy all SRVs+UAVs also if only one of the CBVs changed, into another 'version' of the table. I'm sure you're aware, that's the price we pay, I think many people are doing this.
  4. DX12 DX12 Occlusion Queries

    I agree it's a horrible solution.
  5. DX12 DX12 Occlusion Queries

    Sure but the time budget doesn't allow right now
  6. DX12 DX12 Occlusion Queries

    So the expected CPU-readback approach on PC should be inserting a fence after ResolveQueryData and waiting on it on CPU. Btw, Hodgman, just out of curiosity, do you know by any chance on GCN, if already at the bottom-of-pipe it writes the query results for each of the 4/8 DBs, based on counters, into the backing memory? Or are some caches (DB?) involved?
  7. Hi! I wonder if I can achieve the same (not quite optimal) CPU readback of occlusion queries as with DX11. u64 result = 0; HRESULT hr = deviceCtx11->GetData(id3d11Query, result, sizeof(u64), D3D11_ASYNC_GETDATA_DONOTFLUSH); if (S_OK == hr) return "ready"; else "not ready"; This happens on the CPU. I'm able to see if it's ready or not and do other stuff it isn't. In DX12, ResolveQueryData obviously happens on the GPU. If I put a fence after ResolveQueryData, I can be sure it copied the results into my buffer. However I wonder, if there's any other way then inserting fences after each EndQuery to see if the individual queries already finished. It sounds bad and I guess the fence might do some flushing. I first want to implement what other platforms in our engine do, before changing all of them to some more sensible batched occlusion query querying model. Thanks for any remarks.
  8. Thank you MJP for pointing that out. Our UAVs are in a table, fortunately.
  9. It does work! D3D12_UNORDERED_ACCESS_VIEW_DESC dummyUavDesc = {}; dummyUavDesc.ViewDimension = D3D12_UAV_DIMENSION_TEXTURE3D; dummyUavDesc.Texture3D.FirstWSlice = 0; dummyUavDesc.Texture3D.MipSlice = 0; dummyUavDesc.Texture3D.WSize = 2048; dummyUavDesc.Format = DXGI_FORMAT_R8G8B8A8_SNORM; pD3D12Device->CreateUnorderedAccessView(nullptr, nullptr, &dummyUavDesc, cpuHandle); CreateUnorderedAccessView writes all zeroes to the cpuHandle designated memory. CopyDescriptors() copies the zeroes correctly to the contiguous GPU visible descriptor table and the GPU recognises this. All cool. Thank you SoL!
  10. Hi, SoL! I was looking exactly for this, is it actually documented anywhere? I'm on an unnamed architecture where I could do a memset... as I say it didn't seem very legit. I'm just trying what you propose.
  11. Hello! I can see that when there's a write to UAVs in CS or PS, and I bind a null ID3D11UnorderedAccessView into a used UAV slot, the GPU won't hang and the writes are silently dropped. I hope I amn't dreaming. With DX12, I can't seem to emulate this. I reckon it's impossible. The shader just reads the descriptor of the UAV (from a register/offset based on the root signature layout) and does an "image_store" at some offset from the base address. If it's unmapped, bang, we're dead. I tried zeroing out that GPU visible UAV's range in the table, same result. Such an all-zero UAV descriptor doesn't seem very legit. That's expected. Am I right? How does DX11 do it that it survives this? Does it silently patch the shader or what? Thanks, .P
  12. Lemme give you an example: A block compressed format, such as BC5, which only has R+G, will by default return B=0 and A=1 (I hope). You can force 1 or 0 into any channel, you can shuffle channels around, you can use some of them 1-4 times and you can drop any of them or all of them (replacing them by 0 or 1). I can't think of too many uses for this though.
  13. Then my recommendation remains - always use shader reflection (ID3D11ShaderReflection) to get the offsets of individual members, you can assume almost nothing.
  14. MJP, I understand why the 24 byte struct in array becomes 32 bytes, but why not its final element?
  15. Deferred device contexts

    Maybe think about a constant buffer pool (instead of fixed per-asset buffers), where before each draw operation you pick a "fresh" buffer (of a suitable size), and memcpy your transforms and whatnot into it. You'll promise the driver that you won't be overwriting it under GPU's hands (NO_OVERWRITE), and you hold that promise by not writing into the constant buffer more often than every 2 frames since the last use (simple solution, pronounce buffers fresh again when you're sure GPU passed enough frames) or use fencing (complex solution) to be sure GPU isn't reading anymore. That way, driver won't have to make hidden copies of it. This approach is multi-platform-friendly. The pool needs to be big enough and/or you have to implement the complex solution with more precise fences. Edit: An even simpler solution is stick to your per-asset cbuffer, but have multiple copies (2-3 frames) and cycle them each frame, also with NO_OVERWRITE. Also, do read the article from NVIDIA linked by Infinisearch, it's a good approach.