Jump to content

  • Log In with Google      Sign In   
  • Create Account

Adam Miles

Member Since 09 Jul 2013
Offline Last Active Today, 12:55 PM

#5315886 How to understand GPU profiler data and use it to trace down suspicious abnor...

Posted by on 19 October 2016 - 06:29 PM

Buffer to Buffer copy is probably the best test for measuring bandwidth, but it doesn't represent the best layout for accessing data that is logically two or three dimensional, so stick to textures for that.


There really isn't a good one-estimate-fits-all-hardware approach to knowing how long things should take. The difference alone between 720p and 1080p is 2.25x and the hardware disparity between a reasonable mobile GPU and a high end discrete GPU is > 10x. So depending on whether you're talking about a Titan X rendering at 720p or a mobile GPU rendering at 1080p you could be talking about a 20x difference in GPU time.


I have a pretty good idea for how long typical tasks should take on Xbox One at common resolutions (720p, 900, 1080p), but that just comes from years of looking at PIX captures of AAA titles from the best developers. If you asked me how long X should take on hardware Y at resolution Z I'd probably start with the numbers I know from Xbox One, divide them by however much faster I think the hardware is and then multiply up by the increased pixel count.


It doesn't hurt to try figure out how close you might be coming to various hardware limits a GPU might have; just to see if you're approaching any of those. Metrics like Vertices/second, fill-rate, texture fetches, bandwidth, TFLOPS etc are all readily available for AMD/NVIDIA cards. The only tricky one to work out is how many floating-point operations your shader might cost per pixel/thread as you don't always have access to the raw GPU instructions (at least on NVIDIA cards you don't), you can approximate it from the DXBC though.

#5315874 How to understand GPU profiler data and use it to trace down suspicious abnor...

Posted by on 19 October 2016 - 03:47 PM

You may be right about the 8MB write if your mobile GPU has dedicated VRAM. I'm used to thinking of mobile GPUs as sharing system memory with the CPU, in which case you would have to count both.


I was using this page for my DDR3-800 bandwidth numbers. I took the triple channel memory number, divided by 3 and multiplied by 2 to get back to Dual Channel.


If your GPU definitely has dedicated VRAM and it's writing into it then perhaps you're at only around 50% of the theoretical mark you might expect. That too might not be that surprising given that the tiling mode (i.e. 'layout') of the source memory is surely linear. GPUs will often only be able to hit peak throughput on things like texture fetching / bandwidth when the memory access is swizzled/tiled in such a way that it hits all memory banks exactly as intended.


Have you tried doing a raw Buffer to Buffer copy between your UPLOAD heap and the DEFAULT heap just to test how long it takes?

#5315866 How to understand GPU profiler data and use it to trace down suspicious abnor...

Posted by on 19 October 2016 - 02:23 PM

My maths gives rather different numbers.


800MHz * 8 bytes per clock = 6.4GB/s.

Double that for dual channel memory = 12.8GB/s


You need to read 8MB of memory and write 8MB of memory, so that's 16MB total.


At 12.8GB/s you can move 12.8MB per millisecond. So already it's clear that 16MB will take longer than 1ms to move.


Even if you managed to hit 100% theoretical memory transfer speed (unlikely) then it will take at least 1.25ms. I'd say you're not too far away from exactly the correct speed.

#5314885 Linear Sampler for Texture3D

Posted by on 12 October 2016 - 01:29 PM

The hardware will be doing Trilinear filtering and taking the weighted average of the 8 texels you'd expect.


I'm not sure if your algorithm is sensitive to this, but be aware that there is a finite amount of values a texture coordinate can take between two texels; much less than you'd expect. I believe you're only guaranteed 8 bits of sub-texel precision, meaning that there can only be 256 different possible linear interpolations between two adjacent texels in a 1D texture. Think of your texture coordinate as getting clamped to the nearest 1/256th of a texel and ask yourself whether that might be a problem?


I'm pretty sure hardware is allowed to give more precision to the interpolation, but not less. Have you tried any other GPUs than the one you're using? Perhaps WARP?


Clearly point sampling the 8 values and doing your own interpolation will yield as precise a result as is possible with the numbers you have available, so this may go some way towards explaining why it works without artifacts.

#5313767 What's the benefit of using specific resource format vs using DXGI_FORMAT...

Posted by on 03 October 2016 - 01:41 PM

Just a small correction. 16-bit SNORM will have a step size of 1/32767 due to the fact that only half the bit patterns represent 0.0f to 1.0f (and half for -1.0 to 0.0f). 0.0f is represented only once and both -32768 and -32767 represent -1.0f. This has the benefit of ensuring a symmetrical set of possible values either side of 0.0f. Without this you'd have steps of 1/32768 on the negative side and 1/32767 on the positive side.

#5313005 Stencil Write with Clip/Discard

Posted by on 28 September 2016 - 05:27 AM


Just before the entry point is fine, i.e.:

float4 main() : SV_TARGET
   return 0;

Thanks for such a quick reply, then how to enable ReZ in PC? (sorry for being such a greedy asker, but I can't find related attributes to try)

Big Thanks Again. 


ReZ cannot be enabled or requested by anything you have control over in DirectX. Whether it's used or not is up to AMD's driver.

#5312942 Stencil Write with Clip/Discard

Posted by on 27 September 2016 - 07:34 PM

Just before the entry point is fine, i.e.:

float4 main() : SV_TARGET
   return 0;

#5312940 [DX12] Texture2DArray descriptor used in shader as Texture2D

Posted by on 27 September 2016 - 07:09 PM

A similar question was asked before regarding NULL descriptors and whether Texture2D descriptors were necessarily compatible with Texture2DArray descriptors. The answer was no, they're not. Will it work today? Probably. Will it always work? Maybe. Should you do it? No!


If you want to always create your descriptors as Texture2DArray descriptors why not just write that in HLSL too and explicitly sample from slice 0?


The team have told me in that past that GBV (GPU-Based Validation, added in RS1) will validate descriptors against their intended SRV_DIMENSION, so if you enable GBV you should get a validation error.

#5311378 DirectX 12 Multi Threading

Posted by on 19 September 2016 - 05:04 AM

OK, thanks, that makes sense.


What about sharing resources? Is is safe to call ID3D12Resource->GetGPUVirtualAddress() on more than one thread to use that same resource on different threads?


That's fine as well.

#5299371 Compute shader and nr of threads, typo or misunderstanding?

Posted by on 06 July 2016 - 02:22 PM

I believe it's a typo, yes.


Everything else you said is correct perhaps with the exception of:


- a multiprocessor can handle x thread groups, to fully use available computing power, create x*2 groups, so a stalled multiprocessor can fall back to the other thread group. With 16 multiprocessors, this would be (max) 32 thread groups

- shared memory can be max 32kb, so 16kb per thread group, because if you have 2 per multiprocessor, there wouldn't be enough with >2*16kb


The calculation isn't as simple as creating x * 2 thread groups in order to fully utilise the GPU. Ideally you'd create a lot more than 2x more threads than the GPU has processors in order to give the GPU's scheduler the best possible opportunity to switch to another wave (or thread group) in order to continually issue work. An AMD GPU can handle 10 'waves' of work per "SIMD" in their terminology. More threads is generally better; trust the GPU to schedule them properly. It's not like writing CPU code where creating too many threads can overwhelm the OS' scheduler.


Regarding shared memory, it is true that each thread group can only address 32KB of it at once. However there's nothing to say that the GPU doesn't have a lot more than 32KB of shared memory per "multiprocessor" (aka Compute Unit in GCN speak). GCN GPUs have 64KB per CU so can run two thread groups each using 32KB each simultaneously. There's no reason future cards might not have even more (128KB, say) and in doing so they could run more shared-memory-hungry thread groups at once. Try to keep your use of shared memory to a minimum because it is a scarce resource, but just because each thread group can only address 32KB doesn't *necessarily* mean each "multiprocessor/CU" only has 32KB.

#5298100 DXT Texture compression for DX11

Posted by on 26 June 2016 - 06:50 AM

You answered your own question in the thread title: use DXT compression.


If SlimDX's ToFile function doesn't support compressing to DXT, do it offline using DirectXTex. Texconv is a tool that can convert/compress uncompressed DDS textures to other compressed formats within the DDS container, or you can use the DirectXTex library directly and write your own tool if you wish.

#5296980 [D3D12] Targeting a laptops Nvidia and Integrated chip

Posted by on 17 June 2016 - 11:03 AM

What you're talking about doing is writing an application that uses the Multi-Adapter functionality added to D3D12. Specifically, since you mentioned one NVIDIA and one Intel GPU, it's Heterogeneous Multi-Adapter (two or more GPUs of different designs).


There's a Heterogeneous Multi-adapter sample here: https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Samples/Desktop/D3D12HeterogeneousMultiadapter


There is no switch you can flip that will make this 'just work', it needs to be thought about and designed into the application. I don't think MiniEngine has any multi-adapter code in it yet, although it wouldn't surprise me if it wasn't added some time in the future.

#5296856 gui rendering issue in dx12

Posted by on 16 June 2016 - 03:41 PM

I'm not seeing any API calls to set the Scissor Rect, it'll default to a zero area scissor if you don't.

#5296476 Deferred Context Usage

Posted by on 14 June 2016 - 08:19 AM

Unless "m_List" is a ComPtr or some other 'smart' type that calls ->Release for you when you assign it to a new value, then you're probably leaking command lists. Are you calling Release on your ID3D11CommandLists?

#5296420 [D3D12] How to correctly update constant buffers in different scenarios.

Posted by on 13 June 2016 - 08:48 PM

Note that one draw call doesn't necessarily count as one "read" of a cbuffer.

e.g. on AMD, the draw call is broken up into thread-groups of 64 pixels, and each of those thread-groups will read the cbuffer values from memory into their SGPR's.


Bear in mind though that that 'memory' is still cached by the GPU's normal cache hierarchy. In the case of constants - the K$ ("Constant Cache"), so not every wave will hit system memory and the PCI-E bus. Hopefully only the first wave will cause that fetch to occur. Future waves (and even draws) may continue to hit the cached values until the values are evicted or had to be flushed for correctness.