• Content count

  • Joined

  • Last visited

Community Reputation

3369 Excellent

About ajmiles

  • Rank

Personal Information

  • Interests


  • Twitter
  • Steam
  1. float4x2 and float2x4 are every bit as much a 'matrix' as float4x4 for the purposes of packing. /Zpr (Row Major Packing) will affect float2x4/float4x2 and will cause them to take 4096 bytes instead of 2048 and vice versa depending on whether that flag is set. This shader, when compiled with /Zpr is a 2048 byte constant buffer and reads float4's: cbuffer B { float2x4 stuff[64]; } float4 main(uint i : I) : SV_TARGET { return stuff[i][0] + stuff[i][1]; }
  2. float2x4 stuff[64]; - Is not 2048 bytes, it's 4096 bytes as each 'register' in a constant buffer is padded to float4. No such padding will occur with a StructuredBuffer, so perhaps you're copying a 2048 byte structured buffer into the first half of a constant buffer that the compiler is expecting to be 4096? You probably wanted float4x2 stuff[64] instead? Can you show me your cbuffer layout so we can be sure that that's the problem? I expect either you've only got half the data in the right place or it has been transposed between float2x4 and float4x2.
  3. You can't measure GPU time using QueryPerformanceCounter. All you've done is measure how long it takes to issue the API calls, no?
  4. DX12 MSAA in DX12?

    ResolveSubresource and ResolveSubresouceRegion (new to DX12) still exist if you don't want to do your MSAA resolve manually. If your resolve operation is just an average of the N samples then using the Resolve API will be at least as fast as doing it yourself.
  5. DirectXMath conditional assignment

    XMVectorSelect is what you're looking for. It takes the masks output by functions such as LessOrEqual and each bit in the mask is used to select between A or B.
  6. Back buffer DXGI_FORMAT

    Don't confuse a 10-bit SRGB/Rec709 output with 10-bit HDR/Rec2020. The fact that you're now using a 10-bit swap chain and potentially getting a 10-bit output is a completely separate and unrelated matter from whether you're using HDR or Rec.2020.
  7. Back buffer DXGI_FORMAT

    Certainly no harm in using a 10-bit back buffer. A lot of displays these days will understand a 10-bit signal. Some will have the ability to reproduce 2^30 colours, others might use the extra information to 'dither' the extra levels with their 8-bit capability. Just remember to be gamma correct when you write to it, as there's no _SRGB version of the 10-bit format.
  8. Bindings of constant buffers are separate and unrelated to the currently bound Vertex Shader. When you bind a constant buffer you are binding it to the "pipeline", not to the currently set Vertex Shader. For that reason, you are free to set a constant buffer once and then repeatedly update its contents and change Vertex Shader without needing to rebind the constant buffer.
  9. Am I being dumb, or wouldn't DXGI_FORMAT_B8G8R8A8 come through in the correct order into the shader without messing around with swizzles?
  10. Forward and Deferred Buffers

    I hope you mean 16-bits/channel here rather than 16 bytes/texel! Using R32G32B32A32_FLOAT for the HDR buffers would be practically unheard of. R16G16B16A16_FLOAT is fine for almost everything you would want to do and R11G11B10_FLOAT is commonly used for even most AAA titles.
  11. R&D Tile based particle rendering

    GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point. 1. All threads write to LDS 2. GroupMemoryBarrierWithGroupSync() 3. All threads read LDS. Would be a typical pattern.
  12. R&D Tile based particle rendering

    It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs. I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory. You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.
  13. R&D Tile based particle rendering

    It won't be, no. The hardware doesn't seem to launch a replacement thread group either until all waves in the thread group have retired, so I tend to steer clear of Thread Groups > 1 wave unless I'm using LDS.
  14. Sorry, I was referring specifically to the older Intel Haswell parts only supporting a 32-bit virtual address per resource, it was actually 31-bits according to this table https://en.wikipedia.org/wiki/Feature_levels_in_Direct3D#Support_matrix. If on the Haswell GPUs you can only have a 2GB resource, or even 2GB per process as that table suggests, then it can be a bit restrictive. One way to do Texture Streaming on D3D12 is to reserve virtual address space for the entire texture's mip chain, even including higher resolution mips that aren't yet streamed in due to proximity to the object. You can then using the Reserved Resources (aka Tiled Resources) API to commit physical memory to higher resolution mips as and when they get loaded. However, if you're always going to allocate the full amount of virtual memory, then you can run out of it very quickly, even if you're careful to only use 1GB of physical memory at any one time.
  15. The GPU's address space is likely quite a bit smaller than the full 64-bits. It may be as small as 32-bits on some older Intel parts, limiting you to resources no larger than 4GB in size, but 38 on newer ones I think. AMD's parts have generally been around the 40-bit or 48-bit range, and I think NVIDIA is 40 too. You can query for MaxGPUVirtualAddressBitsPerResource from this structure: https://msdn.microsoft.com/en-us/library/windows/desktop/dn770364(v=vs.85).aspx I've come pretty close with my sparse voxel work to hitting the 40-bit limit (an 8K * 8K * 8K R8 texture is 512GB / 39 bits), but generally only the 32-bit limit is going to pose anyone any problems.