• Advertisement


  • Content count

  • Joined

  • Last visited

Community Reputation

3370 Excellent

About ajmiles

  • Rank

Personal Information

  • Interests


  • Twitter
  • Steam
  1. Why do you need a group barrier at all? Can't you just use the optional 'originalValue' argument of InterlockedMin rather than re-reading the value you just InterlockedMin'ed into?
  2. XNAMath now lives under the name "DirectXMath", also in the Windows SDK. It should just be a drop-in replacement by-and-large.
  3. Do you really want to still be using the DXSDK from June 2010? The Windows SDK (you have 16299.0) already has all the DirectX headers and libraries in it, so why not use that instead?
  4. float4x2 and float2x4 are every bit as much a 'matrix' as float4x4 for the purposes of packing. /Zpr (Row Major Packing) will affect float2x4/float4x2 and will cause them to take 4096 bytes instead of 2048 and vice versa depending on whether that flag is set. This shader, when compiled with /Zpr is a 2048 byte constant buffer and reads float4's: cbuffer B { float2x4 stuff[64]; } float4 main(uint i : I) : SV_TARGET { return stuff[i][0] + stuff[i][1]; }
  5. float2x4 stuff[64]; - Is not 2048 bytes, it's 4096 bytes as each 'register' in a constant buffer is padded to float4. No such padding will occur with a StructuredBuffer, so perhaps you're copying a 2048 byte structured buffer into the first half of a constant buffer that the compiler is expecting to be 4096? You probably wanted float4x2 stuff[64] instead? Can you show me your cbuffer layout so we can be sure that that's the problem? I expect either you've only got half the data in the right place or it has been transposed between float2x4 and float4x2.
  6. You can't measure GPU time using QueryPerformanceCounter. All you've done is measure how long it takes to issue the API calls, no?
  7. DX12 MSAA in DX12?

    ResolveSubresource and ResolveSubresouceRegion (new to DX12) still exist if you don't want to do your MSAA resolve manually. If your resolve operation is just an average of the N samples then using the Resolve API will be at least as fast as doing it yourself.
  8. DirectXMath conditional assignment

    XMVectorSelect is what you're looking for. It takes the masks output by functions such as LessOrEqual and each bit in the mask is used to select between A or B.
  9. Back buffer DXGI_FORMAT

    Don't confuse a 10-bit SRGB/Rec709 output with 10-bit HDR/Rec2020. The fact that you're now using a 10-bit swap chain and potentially getting a 10-bit output is a completely separate and unrelated matter from whether you're using HDR or Rec.2020.
  10. Back buffer DXGI_FORMAT

    Certainly no harm in using a 10-bit back buffer. A lot of displays these days will understand a 10-bit signal. Some will have the ability to reproduce 2^30 colours, others might use the extra information to 'dither' the extra levels with their 8-bit capability. Just remember to be gamma correct when you write to it, as there's no _SRGB version of the 10-bit format.
  11. Bindings of constant buffers are separate and unrelated to the currently bound Vertex Shader. When you bind a constant buffer you are binding it to the "pipeline", not to the currently set Vertex Shader. For that reason, you are free to set a constant buffer once and then repeatedly update its contents and change Vertex Shader without needing to rebind the constant buffer.
  12. Am I being dumb, or wouldn't DXGI_FORMAT_B8G8R8A8 come through in the correct order into the shader without messing around with swizzles?
  13. Forward and Deferred Buffers

    I hope you mean 16-bits/channel here rather than 16 bytes/texel! Using R32G32B32A32_FLOAT for the HDR buffers would be practically unheard of. R16G16B16A16_FLOAT is fine for almost everything you would want to do and R11G11B10_FLOAT is commonly used for even most AAA titles.
  14. R&D Tile based particle rendering

    GroupMemoryBarrierWithGroupSync() is the one you'll see 99% of the time. It blocks all threads in the thread group executing any further until all threads have finished accessing LDS and hit that instruction. It's essentially a cross-wave synchronisation point. 1. All threads write to LDS 2. GroupMemoryBarrierWithGroupSync() 3. All threads read LDS. Would be a typical pattern.
  15. R&D Tile based particle rendering

    It seems reasonable to imagine it /could/ change in the future (or may even have already changed?), but that's up to the hardware, it's not a D3D/HLSL thing. No idea what the behaviour is on other IHVs. I don't think the size of a thread group has much bearing on being able to hide memory latency per se. Obviously you want to make sure the hardware has enough waves to be able to switch between them (4 per SIMD, 16 per CU) is a reasonable target to aim for. But whether that's 16 waves from 16 different thread groups or 16 waves from a single thread group doesn't matter too much so long as enough of them are making forward progress and executing instructions while others wait on memory. You don't want to be writing a 1024 thread thread group (16 waves on AMD) where one wave takes on the lion's share of the work while the other 15 sit around stalled on barriers, that's not going to help you hide latency at all. There's nothing inherently wrong with larger thread groups, you just need to be aware of how the waves get scheduled and ensure that you don't have too many waves sitting around doing nothing.
  • Advertisement