• Content count

  • Joined

  • Last visited

  • Days Won


ajmiles last won the day on October 7

ajmiles had the most liked content!

Community Reputation

3358 Excellent

About ajmiles

  • Rank

Personal Information

  • Interests


  • Twitter
  • Steam
  1. Back buffer DXGI_FORMAT

    Don't confuse a 10-bit SRGB/Rec709 output with 10-bit HDR/Rec2020. The fact that you're now using a 10-bit swap chain and potentially getting a 10-bit output is a completely separate and unrelated matter from whether you're using HDR or Rec.2020.
  2. Back buffer DXGI_FORMAT

    Certainly no harm in using a 10-bit back buffer. A lot of displays these days will understand a 10-bit signal. Some will have the ability to reproduce 2^30 colours, others might use the extra information to 'dither' the extra levels with their 8-bit capability. Just remember to be gamma correct when you write to it, as there's no _SRGB version of the 10-bit format.
  3. Bindings of constant buffers are separate and unrelated to the currently bound Vertex Shader. When you bind a constant buffer you are binding it to the "pipeline", not to the currently set Vertex Shader. For that reason, you are free to set a constant buffer once and then repeatedly update its contents and change Vertex Shader without needing to rebind the constant buffer.
  4. Am I being dumb, or wouldn't DXGI_FORMAT_B8G8R8A8 come through in the correct order into the shader without messing around with swizzles?
  5. Forward and Deferred Buffers

    I hope you mean 16-bits/channel here rather than 16 bytes/texel! Using R32G32B32A32_FLOAT for the HDR buffers would be practically unheard of. R16G16B16A16_FLOAT is fine for almost everything you would want to do and R11G11B10_FLOAT is commonly used for even most AAA titles.
  6. R&D Tile based particle rendering

    GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point. 1. All threads write to LDS 2. GroupMemoryBarrierWithGroupSync() 3. All threads read LDS. Would be a typical pattern.
  7. R&D Tile based particle rendering

    It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs. I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory. You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.
  8. R&D Tile based particle rendering

    It won't be, no. The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups > 1 wave unless I'm using LDS.
  9. Sorry, I was referring specifically to the older Intel Haswell parts only supporting a 32-bit virtual address per resource, it was actually 31-bits according to this table https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#Support_matrix. If on the Haswell GPUs you can only have a 2GB resource, or even 2GB per process as that table suggests, then it can be a bit restrictive. One way to do Texture Streaming on D3D12 is to reserve virtual address space for the entire texture's mip chain, even including higher resolution mips that aren't yet streamed in due to proximity to the object. You can then using the Reserved Resources (aka Tiled Resources) API to commit physical memory to higher resolution mips as and when they get loaded. However, if you're always going to allocate the full amount of virtual memory, then you can run out of it very quickly, even if you're careful to only use 1GB of physical memory at any one time.
  10. The GPU's address space is likely quite a bit smaller than the full 64-bits. It may be as small as 32-bits on some older Intel parts, limiting you to resources no larger than 4GB in size, but 38 on newer ones I think. AMD's parts have generally been around the 40-bit or 48-bit range, and I think NVIDIA is 40 too. You can query for MaxGPUVirtualAddressBitsPerResource from this structure: https://msdn.microsoft.com/en-us/library/windows/desktop/dn770364(v=vs.85).aspx I've come pretty close with my sparse voxel work to hitting the 40-bit limit (an 8K * 8K * 8K R8 texture is 512GB / 39 bits), but generally only the 32-bit limit is going to pose anyone any problems.
  11. R&D Tile based particle rendering

    And so did what I wrote answer your question?
  12. R&D Tile based particle rendering

    What do you mean by the term 'invocation'? To me an invocation is a single thread of execution, meaning a 1080p Quad would "invoke" the pixel shader ~2M times. A single thread of course will be run on a single CU for its lifetime.
  13. R&D Tile based particle rendering

    Using LDS doesn't limit a Compute Shader to a single CU, no. The requirement would be that a single thread group run all its waves on a single CU in order that they all have access to the same bit of LDS. A 256 thread thread-group is 4 waves, and would typically be scheduled to have one wave per SIMD. A 1024 thread thread-group would have 4 waves running on each SIMD (all on the same CU). You're only wasting / not using CUs if you have less thread groups than you have CUs. Since even the biggest AMD parts only have 64 CUs, you'd have to be running at an extremely low resolution to be issuing less than (64 * 1024) threads :).
  14. If you can provide a simplest-possible repro I can test it against the latest compiler and file it with the team responsible for fixing it if it's still broken. If providing the shader is IP sensitive, I can provide you with an email address to send it to directly. (I work for Microsoft)
  15. I don't think we're quite on the same page regarding your current approach then. If your triangles already cover only the necessary pixels that need to be rendered, what purpose does the mask serve and when do you read it?